Skip to content

Weighted variance computation for sparse data is not numerically stable #19546

Closed
@ogrisel

Description

@ogrisel

This issue was discovered when adding tests for #19527 (currently marked XFAIL).

Here is minimal reproduction case using the underlying private API:

https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267

The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):

## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csr [100.] [-2.18040566e-11]
csc_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csc [100.] [-2.18040566e-11]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csr [99.99997] [0.00123221]
csc_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csc [99.99997] [0.00123221]

So the sklearn.utils.extmath._incremental_mean_and_var function for dense numpy arrays is numerically stable both in float64 and float32 (~1e-11 is much less then np.finfo(np.float32).eps), but the sparse counterparts, either incremental are not are all wrong in the same way.

So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in sklearn.utils.extmath._incremental_mean_and_var to the sparse case.

Note: there is another issue opened for the numerical stability of StandardScaler: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (in partial_fit) vs full batch mode (in fit).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions