MNT Avoid catastrophic cancellation in mean_variance_axis #19766

jeremiedbb · 2021-03-25T17:35:35Z

Fixes #19546

Fixes the unexpected lack of precision of the variance when input is sparse, with weights, and when the variance should actually be 0, as described here #19450 (comment)

With this PR the result is 0 as expected

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.utils.sparsefuncs import mean_variance_axis
n_samples = 100
sw = np.random.rand(n_samples)
X = np.zeros(shape=(n_samples, 2))
X[:, 1] = 1.

mean_variance_axis(csr_matrix(X), axis=0, weights=sw)[1]
# array([0., 0.])

ogrisel · 2021-03-26T15:17:53Z

Possibly related to #19546?

jeremiedbb · 2021-03-26T17:35:43Z

Possibly related to #19546?

Absolutely. It aims to solve the same issue. However, it turns out that it's still not as precise as the dense case. I'm digging further :)

jeremiedbb · 2021-03-29T18:08:58Z

I reworked the PR to focus on the catastrophic cancellation described in #19546, caused by variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2. The precision is now comparable the dense case one. Here are the results of the gist linked in #19546 with various rng.

## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [7.2701421e-27]
incr_mean_variance_axis0 csr [100.] [7.2701421e-27]
csc_mean_variance_axis0 [100.] [7.2701421e-27]
incr_mean_variance_axis0 csc [100.] [7.2701421e-27]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [9.3132246e-10]
incr_mean_variance_axis0 csr [99.99997] [9.3132246e-10]
csc_mean_variance_axis0 [99.99997] [9.3132246e-10]
incr_mean_variance_axis0 csc [99.99997] [9.3132246e-10]

## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [0.]
incr_mean_variance_axis0 csr [100.] [0.]
csc_mean_variance_axis0 [100.] [0.]
incr_mean_variance_axis0 csc [100.] [0.]
## dtype=float32
_incremental_mean_and_var [99.99999932] [4.66211111e-13]
csr_mean_variance_axis0 [99.99993] [4.7148196e-09]
incr_mean_variance_axis0 csr [99.99993] [4.7148196e-09]
csc_mean_variance_axis0 [99.99993] [4.7148196e-09]
incr_mean_variance_axis0 csc [99.99993] [4.7148196e-09]

## dtype=float64
_incremental_mean_and_var [100.] [2.01948392e-28]
csr_mean_variance_axis0 [100.] [1.81753553e-27]
incr_mean_variance_axis0 csr [100.] [1.81753553e-27]
csc_mean_variance_axis0 [100.] [1.81753553e-27]
incr_mean_variance_axis0 csc [100.] [1.81753553e-27]
## dtype=float32
_incremental_mean_and_var [99.99999692] [9.51546741e-12]
csr_mean_variance_axis0 [100.] [0.]
incr_mean_variance_axis0 csr [100.] [0.]
csc_mean_variance_axis0 [100.] [0.]
incr_mean_variance_axis0 csc [100.] [0.]

ogrisel

Very nice fix @jeremiedbb! I assume this code has changed too much in main to consider a backport for 0.24.2, but this is fine with me.

sklearn/utils/sparsefuncs_fast.pyx

thomasjpfan · 2021-03-30T01:25:25Z

sklearn/utils/sparsefuncs_fast.pyx

@@ -131,13 +131,19 @@ def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
        np.ndarray[floating, ndim=1] sum_weights_nz = \
            np.zeros(shape=n_features, dtype=dtype)

+        np.ndarray[np.uint64_t, ndim=1] counts = np.full(


A more explicit name:

Suggested change

np.ndarray[np.uint64_t, ndim=1] counts = np.full(

np.ndarray[np.uint64_t, ndim=1] counts_nan = np.full(

sum_weights is the sum of all weights where X is not nan
sum_weights_nan is the sum of weights where X is nan
sum_weights_nz is the sum of weights where X is non zero

Following the same scheme:
counts is the number of elements which are not nan
counts_nz is the number of elements which are non zero

I'd rather keep that. Maybe you missed that the increment is negative (counts[col_ind] -= 1).
Let me try to reorder the code such that the match is clearer (and remove sum_weights_nan, we actually need only 2 out the 3 arrays).

Yes I was mistaken with the suggestion. I was thinking about the negation of it, so it the suggestion should have been counts_non_nan.

If we do not change the name, I think we should at least put a comment above counts to say that it is the number of elements which are not nan.

I added comments to describe the different arrays

sklearn/utils/sparsefuncs_fast.pyx

ogrisel · 2021-03-30T16:50:30Z

@thomasjpfan merge?

thomasjpfan

LGTM!

Maybe @ogrisel should look this over one more time because removing sum_weights_nan is a semi-significant change from his last approval.

ogrisel

I gave it another look and LGTM. Let's merge!

compute weight sum manually to reduce rounding errors

7d5c4a8

github-actions bot added the module:utils label Mar 25, 2021

jeremiedbb mentioned this pull request Mar 25, 2021

BUG: Regression with StandardScaler due to #19527 #19726

Closed

ogrisel self-requested a review March 26, 2021 14:41

use weights shape

2057369

jeremiedbb changed the title ~~MNT Improve precision of variance on sparse input with weights~~ [WIP] MNT Improve precision of variance on sparse input with weights Mar 26, 2021

jeremiedbb added 4 commits March 29, 2021 19:12

avoid catastrophic cancellation

4d6ba86

cln

3947456

add what's new

eb52ff5

Merge branch 'master' into improve-sparse-variance-precision

5c06cc0

jeremiedbb changed the title ~~[WIP] MNT Improve precision of variance on sparse input with weights~~ MNT Avoid catastrophic cancellation in mean_variance_axis Mar 29, 2021

ogrisel approved these changes Mar 29, 2021

View reviewed changes

ogrisel added module:preprocessing Bug Waiting for Reviewer labels Mar 29, 2021

ogrisel added this to the 1.0 milestone Mar 29, 2021

thomasjpfan reviewed Mar 30, 2021

View reviewed changes

jeremiedbb added 2 commits March 30, 2021 12:02

improve readability, remove 1 unecessary array

fce4257

add comments

7b43f52

jeremiedbb mentioned this pull request Mar 30, 2021

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

Merged

thomasjpfan approved these changes Mar 30, 2021

View reviewed changes

ogrisel approved these changes Mar 31, 2021

View reviewed changes

ogrisel merged commit 57d3668 into scikit-learn:main Mar 31, 2021

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

jeremiedbb commented Mar 25, 2021 •

edited

Loading

ogrisel commented Mar 26, 2021

jeremiedbb commented Mar 26, 2021

jeremiedbb commented Mar 29, 2021

ogrisel left a comment

thomasjpfan Mar 30, 2021

jeremiedbb Mar 30, 2021

thomasjpfan Mar 30, 2021 •

edited

Loading

jeremiedbb Mar 30, 2021

ogrisel commented Mar 30, 2021

thomasjpfan left a comment

ogrisel left a comment

	np.ndarray[np.uint64_t, ndim=1] counts = np.full(
	np.ndarray[np.uint64_t, ndim=1] counts_nan = np.full(

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

Conversation

jeremiedbb commented Mar 25, 2021 • edited Loading

ogrisel commented Mar 26, 2021

jeremiedbb commented Mar 26, 2021

jeremiedbb commented Mar 29, 2021

ogrisel left a comment

Choose a reason for hiding this comment

thomasjpfan Mar 30, 2021

Choose a reason for hiding this comment

jeremiedbb Mar 30, 2021

Choose a reason for hiding this comment

thomasjpfan Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

jeremiedbb Mar 30, 2021

Choose a reason for hiding this comment

ogrisel commented Mar 30, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

jeremiedbb commented Mar 25, 2021 •

edited

Loading

thomasjpfan Mar 30, 2021 •

edited

Loading