Skip to content

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,12 +288,19 @@ Changelog
:user:`Clifford Akai-Nettey<cliffordEmmanuel>`.

:mod:`sklearn.calibration`
............................
..........................

- |Fix| The predict and predict_proba methods of
:class:`calibration.CalibratedClassifierCV can now properly be used on
:class:`calibration.CalibratedClassifierCV` can now properly be used on
prefitted pipelines. :pr:`19641` by :user:`Alek Lefebvre <AlekLefebvre>`

:mod:`sklearn.utils`
....................

- |Fix| Fixed a bug in :func:`utils.sparsefuncs.mean_variance_axis` where the
precision of the computed variance was very poor when the real variance is
exactly zero. :pr:`19766` by :user:`Jérémie du Boisberranger <jeremiedbb>`.

Code and Documentation Contributors
-----------------------------------

Expand Down
62 changes: 42 additions & 20 deletions sklearn/utils/sparsefuncs_fast.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -124,23 +124,32 @@ def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
variances = np.zeros_like(means, dtype=dtype)

cdef:
np.ndarray[floating, ndim=1] sum_weights = \
np.full(fill_value=np.sum(weights), shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nan = \
np.zeros(shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nz = \
np.zeros(shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights = np.full(
fill_value=np.sum(weights), shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nz = np.zeros(
shape=n_features, dtype=dtype)

np.ndarray[np.uint64_t, ndim=1] counts = np.full(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more explicit name:

Suggested change
np.ndarray[np.uint64_t, ndim=1] counts = np.full(
np.ndarray[np.uint64_t, ndim=1] counts_nan = np.full(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum_weights is the sum of all weights where X is not nan
sum_weights_nan is the sum of weights where X is nan
sum_weights_nz is the sum of weights where X is non zero

Following the same scheme:
counts is the number of elements which are not nan
counts_nz is the number of elements which are non zero

I'd rather keep that. Maybe you missed that the increment is negative (counts[col_ind] -= 1).
Let me try to reorder the code such that the match is clearer (and remove sum_weights_nan, we actually need only 2 out the 3 arrays).

Copy link
Member

@thomasjpfan thomasjpfan Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was mistaken with the suggestion. I was thinking about the negation of it, so it the suggestion should have been counts_non_nan.

If we do not change the name, I think we should at least put a comment above counts to say that it is the number of elements which are not nan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments to describe the different arrays

fill_value=weights.shape[0], shape=n_features, dtype=np.uint64)
np.ndarray[np.uint64_t, ndim=1] counts_nz = np.zeros(
shape=n_features, dtype=np.uint64)

for row_ind in range(len(X_indptr) - 1):
for i in range(X_indptr[row_ind], X_indptr[row_ind + 1]):
col_ind = X_indices[i]
if not isnan(X_data[i]):
means[col_ind] += (X_data[i] * weights[row_ind])
# sum of weights where X[:, col_ind] is non-zero
sum_weights_nz[col_ind] += weights[row_ind]
# number of non-zero elements of X[:, col_ind]
counts_nz[col_ind] += 1
else:
sum_weights_nan[col_ind] += weights[row_ind]
# sum of weights where X[:, col_ind] is not nan
sum_weights[col_ind] -= weights[row_ind]
# number of non nan elements of X[:, col_ind]
counts[col_ind] -= 1

for i in range(n_features):
sum_weights[i] -= sum_weights_nan[i]
means[i] /= sum_weights[i]

for row_ind in range(len(X_indptr) - 1):
Expand All @@ -149,10 +158,12 @@ def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
if not isnan(X_data[i]):
diff = X_data[i] - means[col_ind]
variances[col_ind] += diff * diff * weights[row_ind]
sum_weights_nz[col_ind] += weights[row_ind]

for i in range(n_features):
variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2
if counts[i] != counts_nz[i]:
# only compute it when it's guaranteed to be non-zero to avoid
# catastrophic cancellation.
variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2
variances[i] /= sum_weights[i]

return means, variances, sum_weights
Expand Down Expand Up @@ -228,23 +239,32 @@ def _csc_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
variances = np.zeros_like(means, dtype=dtype)

cdef:
np.ndarray[floating, ndim=1] sum_weights = \
np.full(fill_value=np.sum(weights), shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nan = \
np.zeros(shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nz = \
np.zeros(shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights = np.full(
fill_value=np.sum(weights), shape=n_features, dtype=dtype)
np.ndarray[floating, ndim=1] sum_weights_nz = np.zeros(
shape=n_features, dtype=dtype)

np.ndarray[np.uint64_t, ndim=1] counts = np.full(
fill_value=weights.shape[0], shape=n_features, dtype=np.uint64)
np.ndarray[np.uint64_t, ndim=1] counts_nz = np.zeros(
shape=n_features, dtype=np.uint64)

for col_ind in range(n_features):
for i in range(X_indptr[col_ind], X_indptr[col_ind + 1]):
row_ind = X_indices[i]
if not isnan(X_data[i]):
means[col_ind] += (X_data[i] * weights[row_ind])
# sum of weights where X[:, col_ind] is non-zero
sum_weights_nz[col_ind] += weights[row_ind]
# number of non-zero elements of X[:, col_ind]
counts_nz[col_ind] += 1
else:
sum_weights_nan[col_ind] += weights[row_ind]
# sum of weights where X[:, col_ind] is not nan
sum_weights[col_ind] -= weights[row_ind]
# number of non nan elements of X[:, col_ind]
counts[col_ind] -= 1

for i in range(n_features):
sum_weights[i] -= sum_weights_nan[i]
means[i] /= sum_weights[i]

for col_ind in range(n_features):
Expand All @@ -253,10 +273,12 @@ def _csc_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
if not isnan(X_data[i]):
diff = X_data[i] - means[col_ind]
variances[col_ind] += diff * diff * weights[row_ind]
sum_weights_nz[col_ind] += weights[row_ind]

for i in range(n_features):
variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2
if counts[i] != counts_nz[i]:
# only compute it when it's guaranteed to be non-zero to avoid
# catastrophic cancellation.
variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2
variances[i] /= sum_weights[i]

return means, variances, sum_weights
Expand Down
20 changes: 20 additions & 0 deletions sklearn/utils/tests/test_sparsefuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,26 @@ def test_mean_variance_axis0():
assert_array_almost_equal(X_vars, np.var(X_test, axis=0))


@pytest.mark.parametrize("dtype", [np.float32, np.float64])
@pytest.mark.parametrize("sparse_constructor", [sp.csr_matrix, sp.csc_matrix])
def test_mean_variance_axis0_precision(dtype, sparse_constructor):
# Check that there's no big loss of precision when the real variance is
# exactly 0. (#19766)
rng = np.random.RandomState(0)
X = np.full(fill_value=100., shape=(1000, 1), dtype=dtype)
# Add some missing records which should be ignored:
missing_indices = rng.choice(np.arange(X.shape[0]), 10, replace=False)
X[missing_indices, 0] = np.nan
X = sparse_constructor(X)

# Random positive weights:
sample_weight = rng.rand(X.shape[0]).astype(dtype)

_, var = mean_variance_axis(X, weights=sample_weight, axis=0)

assert var < np.finfo(dtype).eps


def test_mean_variance_axis1():
X, _ = make_classification(5, 4, random_state=0)
# Sparsify the array a little bit
Expand Down