FIX more precise log loss gradient and hessian #28048

lorentzenchr · 2024-01-02T16:21:07Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR improves gradient and hessian of HalfBinomialLoss thereby preventing overflow of exp(large number) resulting in inf/nan return values.

The implemented change is very carefully designed and tested for minimal to no runtime/performance penalty.

Any other comments?

github-actions · 2024-01-02T16:22:23Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 0293fad. Link to the linter CI: here}

lorentzenchr · 2024-01-02T16:35:27Z

Benchmark for `gradient` on arrays of length 100_000:

Gradient version	timing with large values	timing values in [-10, 10]
sklearn v1.3 (version 1)	723 µs ± 54.9 µs	691 µs ± 71.8 µs
stable see [1] (version 2)	1.02 ms ± 116 µs	673 µs ± 54.3 µs
this PR (version 3)	740 µs ± 81.5 µs	658 µs ± 17.2 µs

[1] https://fa.bianp.net/blog/2019/evaluate_logistic/

%load_ext cython

import numpy as np

# 1. numpy ufunc version, the stable version 1
#   Problem: Returns NaN for large negative values of raw.
def np_gradient_stable1(y_true, raw):
    exp_tmp = np.exp(-raw)
    return ((1 - y_true) - y_true * exp_tmp) / (1 + exp_tmp)


# 2. numpy ufunc version, the more stable version 2
# See https://fa.bianp.net/blog/2019/evaluate_logistic/
def np_gradient_stable2(y_true, raw):
    """Compute expit(x) - b component-wise."""
    out = np.empty_like(raw)
    idx = raw < 0
    exp_r = np.exp(raw[idx])
    y_idx = y_true[idx]
    out[idx] = ((1 - y_idx) * exp_r - y_idx) / (1 + exp_r)
    exp_nr = np.exp(-raw[~idx])
    y_nidx = y_true[~idx]
    out[~idx] = ((1 - y_nidx) - y_nidx * exp_nr) / (1 + exp_nr)
    return out


%%cython -3
# distutils: extra_compile_args = -O3
# cython: cdivision=True
# cython: boundscheck=False
# cython: wraparound=False

import cython
import numpy as np

from libc.math cimport exp
cimport numpy as np


np.import_array()

    
cdef inline double c_gradient1(double y_true, double raw) nogil:
    cdef double exp_tmp = exp(-raw)
    return ((1 - y_true) - y_true * exp_tmp) / (1 + exp_tmp)


cdef inline double c_gradient2(double y_true, double raw) nogil:
    cdef double exp_tmp
    if raw < 0:
        exp_tmp = exp(raw)
        return ((1 - y_true) * exp_tmp - y_true) / (1 + exp_tmp)
    else:
        exp_tmp = exp(-raw)
        return ((1 - y_true) - y_true * exp_tmp) / (1 + exp_tmp)


cdef inline double c_gradient3(double y_true, double raw) nogil:
    cdef double exp_tmp
    # Help branch prediction.
    # Note that scipy.special.logit(np.finfo(float).eps) ~ -36.04365
    if raw > -37:
        exp_tmp = exp(-raw)
        return ((1 - y_true) - y_true * exp_tmp) / (1 + exp_tmp)
    else:
        # expit(raw) = exp(raw) for raw < -37
        return exp(raw) - y_true


### 2. Cython function: loop over ndarray by calling C level functions
def cy_gradient_stable1(double[::1] y_true, double[::1] raw):
    cdef:
        int n_samples
        int i
        cdef double[::1] out = np.empty_like(y_true)
    
    n_samples = y_true.shape[0]
    for i in range(n_samples):
        out[i] = c_gradient1(y_true[i], raw[i])
        
    return out


def cy_gradient_stable2(double[::1] y_true, double[::1] raw):
    cdef:
        int n_samples
        int i
        cdef double[::1] out = np.empty_like(y_true)
    
    n_samples = y_true.shape[0]
    for i in range(n_samples):
        out[i] = c_gradient2(y_true[i], raw[i])
        
    return out


def cy_gradient_stable3(double[::1] y_true, double[::1] raw):
    cdef:
        int n_samples
        int i
        cdef double[::1] out = np.empty_like(y_true)
    
    n_samples = y_true.shape[0]
    for i in range(n_samples):
        out[i] = c_gradient3(y_true[i], raw[i])
        
    return out


rng = np.random.default_rng(0)
y_true = rng.binomial(1, 0.5, size=100_000).astype(np.float64)
raw = 20 * rng.standard_normal(100_000, dtype=np.float64)  # make sure some values are <= -37 and > 33
print(f"min and max raw = {np.min(raw)}, {np.max(raw)}")
# min and max raw = -91.87980390791544, 85.34683410460212

np.allclose(np_gradient_stable1(y_true, raw), np_gradient_stable2(y_true, raw)), \
np.allclose(np_gradient_stable1(y_true, raw), cy_gradient_stable1(y_true, raw)), \
np.allclose(np_gradient_stable1(y_true, raw), cy_gradient_stable2(y_true, raw)), \
np.allclose(np_gradient_stable1(y_true, raw), cy_gradient_stable3(y_true, raw))
# True

%timeit -r20 np_gradient_stable1(y_true, raw)
# 666 µs ± 20.3 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

%timeit -r20 np_gradient_stable2(y_true, raw)
3.88 ms ± 223 µs per loop (mean ± std. dev. of 20 runs, 100 loops each)

%timeit -r20 cy_gradient_stable1(y_true, raw)
# 723 µs ± 54.9 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

%timeit -r20 cy_gradient_stable2(y_true, raw)
# 1.02 ms ± 116 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

%timeit -r20 cy_gradient_stable3(y_true, raw)
740 µs ± 81.5 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)


# Same for smaller values of raw
raw2 = np.linspace(-10, 10, 100_000)

%timeit -r20 cy_gradient_stable1(y_true, raw2)
# 691 µs ± 71.8 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

%timeit -r20 cy_gradient_stable2(y_true, raw2)
# 673 µs ± 54.3 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

%timeit -r20 cy_gradient_stable3(y_true, raw2)
# 658 µs ± 17.2 µs per loop (mean ± std. dev. of 20 runs, 1,000 loops each)

Precision

"version 1" is the actual implementation, "version 3" this PR.
Observe the red outliers in the top right corner of version 1!

import mpmath as mp


# Stolen from scipy
def mpf2float(x):
    """
    Convert an mpf to the nearest floating point number. Just using
    float directly doesn't work because of results like this:

    with mp.workdps(50):
        float(mpf("0.99999999999999999")) = 0.9999999999999999

    """
    return float(mp.nstr(x, 17, min_fixed=0, max_fixed=0))


def mp_gradient(y_true, raw, dps=50):
    y_true, raw = np.asarray(y_true), np.asarray(raw)
    out = np.empty_like(y_true)
    with mp.workdps(dps):
        for i in range(len(y_true)):
            y, r = mp.mpf(float(y_true[i])), mp.mpf(float(raw[i]))
            res = mp.mpf(1) / (mp.mpf(1) + mp.exp(-r)) - y
            out[i] = mpf2float(res)
    return out


def rel_accuracy(test, reference):
    result = np.abs((test - reference) / np.maximum(1, reference))
    result[np.isnan(result)] = 10
    return result


import matplotlib.pyplot as plt


def scatter_with_outlier(ax, x, y, threshold=1e-5, **kwargs):
    ax.scatter(x, y, **kwargs)
    mask = y >= threshold
    ax.scatter(x[mask], y[mask], color="red", **kwargs)

raw3 = np.sinh(np.linspace(np.arcsinh(-1000), np.arcsinh(1000), 20_001))
exact = mp_gradient(np.ones_like(raw3), raw3)
result1 = np.asarray(cy_gradient_stable1(np.ones_like(raw3), raw3))
result2 = np.asarray(cy_gradient_stable2(np.ones_like(raw3), raw3))
result3 = np.asarray(cy_gradient_stable3(np.ones_like(raw3), raw3))


fig, axes = plt.subplots(ncols=3, figsize=(12, 4), sharey=True)
scatter_with_outlier(axes[0], raw3, rel_accuracy(result1, exact), s=1, label="version 1")
scatter_with_outlier(axes[1], raw3, rel_accuracy(result2, exact), s=1, label="version 2")
scatter_with_outlier(axes[2], raw3, rel_accuracy(result3, exact), s=1, label="version 3")
axes[0].set_ylabel('relative precision')
for i in range(len(axes)):
    axes[i].set_xlabel('raw_prediction')
    axes[i].set_xscale('symlog')
    axes[i].set_yscale('symlog', linthresh=1e-30)
    axes[i].set_title(f"version {i+1}")
fig.suptitle("Precision with y_true=1")

Details for mpmath implementation:

def mp_logloss(y_true, raw):
    with mp.workdps(100):
        y_true, raw = mp.mpf(float(y_true)), mp.mpf(float(raw))
        out = mp.log1p(mp.exp(raw)) - y_true * raw
    return mpf2float(out)


def mp_gradient(y_true, raw):
    with mp.workdps(100):
        y_true, raw = mp.mpf(float(y_true)), mp.mpf(float(raw))
        out = mp.mpf(1) / (mp.mpf(1) + mp.exp(-raw)) - y_true
    return mpf2float(out)

def mp_hessian(y_true, raw):
    with mp.workdps(100):
        y_true, raw = mp.mpf(float(y_true)), mp.mpf(float(raw))
        p = mp.mpf(1) / (mp.mpf(1) + mp.exp(-raw))
        out = p * (mp.mpf(1) - p)
    return mpf2float(out)


y, raw = 0.0, 37.
mp_logloss(y, raw), mp_gradient(y, raw), mp_hessian(y, raw)

lorentzenchr · 2024-01-05T09:26:52Z

@jjerphan gentle ping for a review. The actual change is quite small, comments and tests are the majority of the diff.

jjerphan

LGTM. Thank you, @lorentzenchr.

Side-comment: I wonder if we could use mpmath optionally in tests ; see one of the comments in context.

sklearn/_loss/tests/test_loss.py

sklearn/_loss/_loss.pyx.tp

sklearn/_loss/tests/test_loss.py

sklearn/_loss/_loss.pyx.tp

lorentzenchr · 2024-01-08T22:38:09Z

@glemaitre @lesteve While working on #28063, I played a little bit around in https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py and got some errors pretty soon.
Those errors are fixed by this PR. So I would advocate to bring it in v1.4.1 as bugfix.

I'm sorry to have introduced this bug with the new loss functions for the old Gradient Boosting. There are, however, no tests for it in the old GB tests!!!

glemaitre · 2024-01-09T15:27:47Z

We can introduce in 1.4.0 still. This is a regression that we got before releasing (that is nice). To be certain, it would still be good to have the entry in the changelog because the bug could occur in the HistGradientBoosting prior to 1.4?

ogrisel

I played a little bit around in https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py and got some errors pretty soon.

Out of curiosity, what kind of hyperparameter combinations lead to large raw_prediction values leading to numerical stability problems during training?

Would it make sense to include such a case as a public API level non-regression tests?

Otherwise I am fine with the private-level loss module non-regression tests only. They are quite extensive and look good to me.

Please let move the changelog entry to 1.4.0 prior to merging this PR though.

ogrisel · 2024-01-09T15:22:09Z

doc/whats_new/v1.5.rst

+  :class:`ensemble.HistGradientBoostingClassifier` and
+  :class:`linear_model.LogisticRegression`.
+  :pr:`28048` by :user:`Christian Lorentzen <lorentzenchr>`.
+


I agree this fix should be included in 1.4.0. Please move the changelog entry accordingly.

This reverts commit a3b94c5.

lorentzenchr · 2024-01-09T17:42:02Z

Out of curiosity, what kind of hyperparameter combinations lead to large raw_prediction values leading to numerical stability problems during training?

I changed "min_samples_split": 5, to "min_samples_leaf": 2,.

Would it make sense to include such a case as a public API level non-regression tests?

I think the test of the loss functions cover that very thoroughly. To be sure, I added 0293fad.

Please let move the changelog entry to 1.4.0 prior to merging this PR though.

Done. But have a look at the place. It effects different classes from very different modules.

ogrisel · 2024-01-10T09:01:09Z

Thanks for the fix. @glemaitre @jeremiedbb this will require a backport to 1.4.X.

lorentzenchr added 2 commits January 2, 2024 16:29

TST add high precision log loss test cases

b237d24

ENH more precise gradient and hessian for CyHalfBinomialLoss

ba58f6f

github-actions bot added the cython label Jan 2, 2024

lorentzenchr added Performance Numerical Stability labels Jan 2, 2024

jjerphan self-requested a review January 5, 2024 09:38

jjerphan mentioned this pull request Jan 6, 2024

BUG: special: Loss of precision in logsumexp scipy/scipy#18295

Closed

jjerphan approved these changes Jan 6, 2024

View reviewed changes

lorentzenchr added 3 commits January 7, 2024 13:59

CLN simplify test_loss_on_specific_values

7a884dc

DOC add whatsnew

a3b94c5

TST add mpmath code as comment

5e6c694

lorentzenchr changed the title ~~ENH more precise log loss gradient and hessian~~ FIX more precise log loss gradient and hessian Jan 8, 2024

lorentzenchr added the Bug label Jan 9, 2024

lorentzenchr added this to the 1.4 milestone Jan 9, 2024

ogrisel approved these changes Jan 9, 2024

View reviewed changes

lorentzenchr added 2 commits January 9, 2024 18:37

Revert "DOC add whatsnew"

bcd10b5

This reverts commit a3b94c5.

DOC whatsnew 1.4

59f7c0b

TST add raw_prediction=+-1e20

0293fad

jjerphan merged commit 5ad8e45 into scikit-learn:main Jan 9, 2024

lorentzenchr deleted the logloss_gradient branch January 9, 2024 19:30

glemaitre added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jan 10, 2024

lorentzenchr mentioned this pull request Jan 10, 2024

FIX divide by zero in line search of GradientBoostingClassifier #28095

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jan 17, 2024

FIX more precise log loss gradient and hessian (scikit-learn#28048)

52d93e9

jeremiedbb pushed a commit that referenced this pull request Jan 17, 2024

FIX more precise log loss gradient and hessian (#28048)

9226ef4

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024

FIX more precise log loss gradient and hessian (scikit-learn#28048)

3b240a6

This was referenced Feb 18, 2024

ENH more stable gradient of CrossEntropy microsoft/LightGBM#6327

Merged

ENH minor improvement of binomial hessian #28453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX more precise log loss gradient and hessian #28048

FIX more precise log loss gradient and hessian #28048

lorentzenchr commented Jan 2, 2024

github-actions bot commented Jan 2, 2024 •

edited

Loading

lorentzenchr commented Jan 2, 2024 •

edited

Loading

lorentzenchr commented Jan 5, 2024

jjerphan left a comment

lorentzenchr commented Jan 8, 2024 •

edited

Loading

glemaitre commented Jan 9, 2024

ogrisel left a comment •

edited

Loading

ogrisel Jan 9, 2024

lorentzenchr commented Jan 9, 2024 •

edited

Loading

ogrisel commented Jan 10, 2024

FIX more precise log loss gradient and hessian #28048

FIX more precise log loss gradient and hessian #28048

Conversation

lorentzenchr commented Jan 2, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Jan 2, 2024 • edited Loading

✔️ Linting Passed

lorentzenchr commented Jan 2, 2024 • edited Loading

Benchmark for gradient on arrays of length 100_000:

Precision

lorentzenchr commented Jan 5, 2024

jjerphan left a comment

Choose a reason for hiding this comment

lorentzenchr commented Jan 8, 2024 • edited Loading

glemaitre commented Jan 9, 2024

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel Jan 9, 2024

Choose a reason for hiding this comment

lorentzenchr commented Jan 9, 2024 • edited Loading

ogrisel commented Jan 10, 2024

github-actions bot commented Jan 2, 2024 •

edited

Loading

lorentzenchr commented Jan 2, 2024 •

edited

Loading

Benchmark for `gradient` on arrays of length 100_000:

lorentzenchr commented Jan 8, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

lorentzenchr commented Jan 9, 2024 •

edited

Loading