Weighted variance computation for sparse data is not numerically stable

This issue was discovered when adding tests for #19527 (currently marked XFAIL).

Here is minimal reproduction case using the underlying private API:

https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267

The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):

```
## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csr [100.] [-2.18040566e-11]
csc_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csc [100.] [-2.18040566e-11]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csr [99.99997] [0.00123221]
csc_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csc [99.99997] [0.00123221]
```

So the `sklearn.utils.extmath._incremental_mean_and_var` function for dense numpy arrays is numerically stable  both in float64 and float32 (~1e-11 is much less then `np.finfo(np.float32).eps`), but the sparse counterparts, either incremental are not are all wrong in the same way.

So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in `sklearn.utils.extmath._incremental_mean_and_var` to the sparse case.

Note: there is another issue opened for the numerical stability of `StandardScaler`: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (in `partial_fit`) vs full batch mode (in `fit`). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Weighted variance computation for sparse data is not numerically stable #19546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Weighted variance computation for sparse data is not numerically stable #19546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions