-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
loss function name consistency in gradient boosting #3481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
maybe something for @dsullivan7 |
Yes, I was just thinking this actually. I was hoping to extract the loss functions out from SGD so perhaps they can be shared between sgd_fast.pyx and _gradient_boosting.pyx? I'm not too comfortable with gradient_boosting but I'll take a look. |
it's more in terms of API here. |
The underlying methods are completely different so I don't think we need to share code. It's just a matter of unifying the names. lad -> absolute The absolute loss is missing from SGD right know but it can be implemented by setting epsilon=0 in the epsilon insensitive loss. |
Conversely, I think it should also be possible to add the squared_hinge and modified_huber losses to gradient boosting. The hinge is loss is not differentiable so it cannot be added. |
Ok sounds good, I'll take a crack at making the aliases then. I'll also check in on possibly adding squared_hinge and modified_huber. Is there a reason that the underlying methods are completely different? I haven't looked at it so I don't know. |
The elements of the stochastic (sub-)gradient in SGD are with respect the feature coefficients coef[j], so the gradient is n_features dimensional. The elements of the gradient in gradient boosting are with respect to the predictions y_pred[i], so the gradient is n_samples dimensional. In addition, gradient boosting needs a method to update the underlying trees. |
Do we want to fix |
Yikes, yes I'll take a look at that too |
+1e6 too It was not clear in the SO answer but the reason it's called L1 and L2 losses is because of the constrained formulation of the soft-margin SVM. The sum |
On a side note - hinge loss is differentiable... see Charlie Tang's paper. It might not be worth the complexity to implement right now, but I think it is possible and it worked well for the tasks I tried it on (neural net image recognition). |
As a side note to the side note: Hinton mentioned something about LeCun having done max-margin neural nets in his Coursera course, and I gather he meant optimizing for hinge loss. This would have been ~two decades ago. |
@kastnerkyle Just to clarify, the hinge loss is only differentiable in the squared / l2 case. I'd love to add this loss in gradient boosting. I'd expect it to use fewer trees than the log loss. |
I have been going back and forth on this for a while now - do you think equation (10) in the paper is meant to be the differentiated L1 loss? It looks like it, but I don't know how to verify it. When I did this last, I only implemented the L2 version because of their earlier statements about only L2 being differentiable in section 2.2. However on the same page as eq (10) and (11) in section 2.4, they say that they tested with both L1 and L2 SVM, which would mean they got a gradient for both. I have been too spoiled by Theano's gradient magic... |
@larsmans I would not be surprised if after the paper, Y. LeCun was like "oh by the way, nice paper but I did that 20 years ago". Seems to happen a lot - hopefully that means it was a good idea :) |
@kastnerkyle Eq. 10 is technically a sub-gradient, not a gradient so one should use it with the sub-gradient method, not gradient descent. This has implications on convergence proofs, choice of the learning rate, etc. I haven't read the paper but I'd guess it is lacking theoretical guarantees. |
@mblondel That makes a lot of sense, and explains the confusion I had with the paper. Thanks! Ultimately they say "we tried both, but L2-SVM was always better on our tests" - which may or may not have to do with the difference between gradient/subgradient if they were using eq. 10 in the paper for backprop. Either way squared_hinge should be quite nice for GBRT I think, thanks for clarifying. |
It looks like mdevience and bdevience are deprecated, so it might not be a good idea to add aliases for them. |
Superseded by #18248. |
It would be nice if the
loss
option in gradient boosting could be more consistent with the one in SGD. Rather than deprecating names in gradient boosting, I suggest adding aliases.The text was updated successfully, but these errors were encountered: