Description
This issue was discovered when adding tests for #19527 (currently marked XFAIL).
Here is minimal reproduction case using the underlying private API:
https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267
The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):
## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csr [100.] [-2.18040566e-11]
csc_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csc [100.] [-2.18040566e-11]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csr [99.99997] [0.00123221]
csc_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csc [99.99997] [0.00123221]
So the sklearn.utils.extmath._incremental_mean_and_var
function for dense numpy arrays is numerically stable both in float64 and float32 (~1e-11 is much less then np.finfo(np.float32).eps
), but the sparse counterparts, either incremental are not are all wrong in the same way.
So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in sklearn.utils.extmath._incremental_mean_and_var
to the sparse case.
Note: there is another issue opened for the numerical stability of StandardScaler
: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (in partial_fit
) vs full batch mode (in fit
).