-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
A common private module for differentiable loss functions used as objective functions in estimators #15123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd be happy to unify everything... But that looks like a hard problem. The first thing that we can reasonably unify are the If we want to unify more things, the problem gets even harder. E.g. gradient computation: with respect to what, predictions or parameters? predicted classes or predicted decision function? etc... And of course, input shape issues... |
I agree that it's not a trivial problem, maybe even more so for GBDT, and I'm not sure that everything can indeed be unified. But say unifying MLP losses with SGD losses should be relatively easy I think; we could start there. As long as everything is private, moving incrementally could be doable. |
Briefly looking at it, MLP defines a half LS loss for numpy arrays in pure Python, while the SGD losses are point-wise Cythonized methods (also a half LS). EDIT: they are also |
Indeed. Well we could have one cdef method for scalar nogil calls and one cpdef for vectorized ones in the same class. That would also help make sure they are consistent. |
I agree this is a bit annoying to have all those implementations in different places, but OTOH writing losses is in general pretty easy and bugs are easily caught. So I wouldn't make it a high priority right now maybe |
Duplication was bearable so far but personally I would like to extend Tweedie regression to HistGBRT (and probably classical GBRT as well) and MLP and for this loss family we probably need to factorize command private functions as they are significantly more complex than (half) least squares. |
Should the link function be tied together with/incorporated into the loss class or should they be independent? Example binary log-loss/cross entropy:
With version 1, sometimes simplifications and more robust implementations are possible, see
Note, that raw_prediction=z .
On top, a gradient function in version 1 would automatically involve the derivative of the link function. For optimization this variant might be advantageous, the opposite for validation metrics. |
I think it's okay to include link, or to have link-specific methods,
depending on which is the most elegant design that enables those kinds of
optimisations.
|
If we decide to modularize more that what is currently done
sklearn/ensemble/_hist_gradient_boosting/loss.py we need careful
benchmarking to avoid any performance regression.
For each loss function we could also have a generic implementation that
would work for any link function + a specialized one that would be
optimized for a given link function, and then a private factory that
returns the loss object from a (loss name, link function name pair).
This would have the side benefit of making it easy to check if the two
agree in readable unit tests.
…--
Olivier
|
What do you think about the following design? 1. Link functionsClose to what we already have in class BaseLink(ABC):
"""Abstract base class for differentiable, invertible link functions.
Convention:
- link function g: raw_prediction = g(y_pred)
- inverse link h: y_pred = h(raw_prediction)
"""
@abstractmethod
def link(self, y_pred, out=None):
@abstractmethod
def derivative(self, y_pred, out=None):
@abstractmethod
def inverse(self, raw_prediction, out=None):
@abstractmethod
def inverse_derivative(self, raw_prediction, out=None): As an example, 2. Base Loss classclass BaseLoss(BaseLink):
def __init__(self, hessians_are_constant=False):
def __call__(self, y_true, raw_prediction, sample_weight=None):
"""Return the weighted average loss."""
@abstractmethod
def pointwise_loss(self, y_true, raw_prediction, sample_weight=None,
out=None):
"""Return loss value for each input.""""
@abstractmethod
def fit_intercept_only(self, y_true, sample_weight=None):
"""Return raw_prediction of an intercept-only model."""
def gradient(self, y_true, raw_prediction, sample_weight=None, out=None):
"""Calculate gradient of loss w.r.t. raw_prediction."""
y_pred = self.link.inverse(raw_prediction)
out = self.gradient_ypred(y_true=y_true, y_pred=y_pred,
sample_weight=sample_weight, out=out)
return out
@abstractmethod
def gradient_ypred(self, y_true, y_pred, sample_weight=None, out=None):
"""Calculate gradient of loss w.r.t. raw_prediction."""
def gradient_hessian(self, y_true, raw_prediction, sample_weight=None,
gradient=None, hessian=None):
"""Calculate gradient and hessian of loss w.r.t. raw_prediction."""
...
@abstractmethod
def gradient_hessian_ypred(self, y_true, y_pred, sample_weight=None,
gradient=None, hessian=None):
"""Calculate gradient and hessian of loss w.r.t. raw_prediction.""" 3. Specific loss classesclass BinaryCrossEntropy(LogitLink, BaseLoss):
... NotesWhat I like here, is that a hessian (for single target estimation) would be a diagonal 1d array which represents the diagonal 2d array |
The more is think about the design of such a loss function module, the more I think one should distinguish between loss functions used for fitting/estimation and loss functions used for validation/comparison (modul Example: The Poisson deviance has minimum at 0. This is nice for interpretation and allows to define an analog of R^2. But for fitting is is possible to omit some "constant" terms with the downside of not knowing the optimal point anymore. |
Except that we do print the loss occasionally e.g. to monitor convergence, and if it doesn't match the the documented loss it can lead to confusion. |
@rth Do you have a suggestion how to tackle this? Would it be enough to document the discrepancy?
|
The main challenge, I think, is do define a common API. Some estimators expect single number output, others arrays. Some would like to use parallelized versions others the parallelism of their own, ... |
Among my main motivations are:
|
Yes I think as long as used loss is documented it should be fine. It's a challenging project indeed :) |
So let's play a little IT architect. Current use of loss functions (for single targets). 1. HistGradientBoostingfiles in sklearn.ensemble._hist_gradient_boosting: loss.py, _loss.pyx
2. GradientBoostingfile sklearn/ensemble/_gb_losses.py
3. SGDfile sklearn/linear_model/_sgd_fast.pyx
4. GLMTo make it short, gradients, hessians and link function as for HistGradientBoosting are sufficient. 5. MLPfile sklearn/neural_network/_base.py
6. Trees
This post might be updated in the future. |
My plan would be to have one working design for HistGradientBoosting and GLM. |
Currently, I'm blocked by cython/cython#3607. My idea was for each loss class to implement only 3 single cython functions (loss, grad, hess) for scalar values (single samples) and then use virtual inheritance of the parent class to deliver these functions working on arrays. Something like Edit: Changed # fused type for target variables
ctypedef fused Y_DTYPE:
cython.float
cython.double
# fused type for gradients and hessians
ctypedef fused G_DTYPE:
cython.float
cython.double
cdef class cLossFunction:
...
cdef double cgradient(self, double y_true, double raw_prediction) nogil:
return 0
def gradient(self,
Y_DTYPE[:] y_true, # IN
Y_DTYPE[:] raw_prediction, # IN
Y_DTYPE[:] sample_weight, # IN
G_DTYPE[:] out, # OUT
):
cdef:
int n_samples
int i
n_samples = raw_prediction.shape[0]
for i in prange(n_samples, schedule='static', nogil=True):
out[i] = sample_weight[i] * self.cgradient(y_true[i], raw_prediction[i])
cdef class cHalfSquaredError(cLossFunction):
"""Half Squared Error with identity link."""
...
cdef double cgradient(self, double y_true, double raw_prediction) nogil:
return raw_prediction - y_true The point is that HistGradientBoosting is longing for extension types as it uses different types for target |
As long as cython/cython#3607 is unresolved, I would be interested in opinions to the following 3 workarounds:
|
1 means we're not backward compatible, right? For 2 and 3, unfortunately I don't really see how these options will effectively help reducing the complexity or redundancy of the losses code, which I believe was the original motivation here |
News: cython/cython#3607 doesn't seem to be a blocker anymore, as long as the cdef functions like Design reasoning:
This sounds a bit like the function dispatch mechanism for generating numpy ufuncs, except the |
Updated 24.11.2020
Time ratio of log-loss (compared to numpy
|
Another question of mine is how important is parallelism, i.e. |
Assumptions: With the above assumptions and the benchmarks, I propose the following sketched design: ctypedef double (*fpointer)(double, double) nogil
cdef void generic_loop(fpointer f, double[::1] y_true, double[::1] raw, double[::1] out):
with nogil:
for i in range(size): # could be prange
out[i] = f(y_true[i], raw[i])
cdef double c_logloss(double y_true, double raw) nogil:
"""Defines the log-loss for a single point"""
cdef class Logloss():
cdef double loss_point(double y_true, double raw) nogil:
return c_logloss(double y_true, double raw)
def loss(self, double[::1] y_true, double[::1] raw):
cdef double[::1] out = np.empty_like(y_true)
generic_loop(c_logloss, y_true, y_raw, out)
return np.asarray(out) Alternatives:
Just for fun and, unfortunately unresolved: have a look at this old mail thread [Cython] Wacky idea: proper macros. |
Thanks for doing this analysis @lorentzenchr !
If we don't have this constraint, and just copy the pointwise implementation as needed with a range or a prange, would that simplify things? Or is the virtual inheritence still an issue?
The proposed design sounds reasonable to me but I'm still a bit surprised by the slowdown we see in the benchmarks above with respect to the plain numpy implementation. Do you have an idea what this could be due to?
In that example, if np_logloss is faster couldn't we just use it for the vectorized version? Or is the issue that we need to release GIL? |
Maybe, I misunderstand you. My goal is to avoid implementing the range/prange every single time as it would be repeated #losses * #functions > 10 times at least (much depending on number of losses, #functions ~ 3).
The plain numpy gets slower with a numerically more stable solution. In the benchmark, it is
For thread parallelism via |
As a default approach that sounds reasonable. I'm just saying that in some cases wring that code twice without factorizing it into a function might also help some optimizations in the for loops: either because of auto-vectorization (with the use of SIMD intrinsics) by the compiler when per-computing some quantities, or because it's more cache friendly (e.g. due to use of BLAS). For instance with the hinge loss implemented for arrays, it's probably enough to allocate an array of zeros, and then set all the elements above the threshold. There is no need to set to 0 the elements that are already zero, which would be done in the factorized function. There OpenMP would also likely not help, as it's a very simple and memory bound operations. In general I'm a bit weary of prange used everywhere, unless one can demonstrate that it's indeed a performance bottleneck (I'm not too familiar with the HGBDT code, maybe it is there), as on multicore systems (16+ CPU cores) it leads to a lot of wasted resources and sometimes slower performance (CPU oversubscriptuon etc)
In the loop indeed, but the overall method/function doesn't as far as I can tell (e.g. Overall I would be +1 to start with the approach you proposed with factorized function, but without OpenMP. And then tune performance if needed on a case by case basis. |
I upated my notebook with a numerically more stable implementation of the log loss in Chapter 2. In this case the own implementation wins over pure numpy. |
How to best get such a deep code change in? I propose one PR for the new _loss module alone, then one follow-up PR per module. Difficulties with this approach:
Big advantage is that this way is more digestible than one huge PR. |
In a prototype with a generic function wrapper to carry out the loop, I got consistently worse performances than with a manual loop. This is consistent with the findings of the above linked notebook for larger sample sizes. Example: BinaryCrossEntropy This brings me back to manual loops resulting in a lot of code duplication. But a PR with worse performance won't be accepted anyway, I think. @rth About |
Noticeably worse performance on the estimator training probably not, but the question is how much of the estimator fit time is spent computing the loss and gradients. The small slow down on the loss calculation might not matter much for instance.
Are you sure n_threads=1 wouldn't create a thread pool in Cython and would be equivalent in performance to a serial function?
Sounds good. The first PR doesn't have to be exhaustive either, if take into account two types of estimators that would already be great. All of this is (or should be propitiate) so several rounds of smallish iterations might indeed be faster to merge. |
We now have #20567 merged. Let's close this issue when some modules make use of it, e.g. (histogram) gradient bossting and linear models. |
Meanwhile, the new loss module is used for HGBT #20811 and in LogisticRegression #21808. Open are:
|
Currently each model that needs it defines their own losses, it might be useful to put them all in one place to see if anything could be re-used in the future.
In particular, losses are defined in,
sklearn/ensemble/_hist_gradient_boosting/loss.py
sklearn/ensemble/_gb_losses.py
sklearn/linear_model/sgd_fast.pyx
sklearn/linear_model/_glm/distributions.py
sklearn/neural_network/_base.py:LOSS_FUNCTIONS
sklearn.metrics
sklearn/tree/_criterion.pyx
This issues proposed to progressively put most of them (except for RF criterions) in a single private folder
sklearn/_loss
to see what could be made less redundant there.In particular, for the GLM PR in #1430 this would allow breaking circular dependency between the
sklearn.linear_model
andsklearn.metrics
modules #14300 (comment)@NicolasHug I don't know if you already had some plans about unifying new and legacy gradient boosting losses and if this would go in the right direction. cc @ogrisel
Weakly related to #5044 #3481 #5368
The text was updated successfully, but these errors were encountered: