Skip to content

FIX Fix gram validation: dtype-aware tolerance #22059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 7, 2022
Merged

FIX Fix gram validation: dtype-aware tolerance #22059

merged 16 commits into from
Jun 7, 2022

Conversation

MalteKurz
Copy link
Contributor

@MalteKurz MalteKurz commented Dec 22, 2021

Reference Issues/PRs

Fixes #21997. See also #19004.

What does this implement/fix? Explain your changes.

In #19004 validations for user-supplied gram matrices have been introduced. A core part of the check is based on re-computing a single element of the gram matrix and checking whether it is equal to the provided entry of the Gram matrix.

The check can cause "false alarms" because there is no guarantee that np.dot always delivers "numerically" the same result if either called with two 2d-arrays (specific element extracted) or with two 1d-arrays corresponding to that specific element. Two examples are provided in the following

import numpy as np
from sklearn.linear_model import ElasticNet

# Example 1
rng = np.random.RandomState(0)
X = rng.binomial(1, 0.25, (100, 2)).astype(np.float32)
y = rng.random(100).astype(np.float32)
X_c = X - np.average(X, axis=0)

print(np.dot(X_c.T, X_c)[1, 1])
print(np.dot(X_c[:, 1], X_c[:, 1]))

precompute = np.dot(X_c.T, X_c)
ElasticNet(precompute=precompute).fit(X_c, y)

# Example 2
rng = np.random.RandomState(58)
X = rng.binomial(1, 0.25, (1000, 4)).astype(np.float32)
y = rng.random(1000).astype(np.float32)
X_c = X - np.average(X, axis=0)

print(np.dot(X_c.T, X_c)[2, 3])
print(np.dot(X_c[:, 2], X_c[:, 3]))

precompute = np.dot(X_c.T, X_c)
ElasticNet(precompute=precompute).fit(X_c, y)

As a consequence, the validation can even kick in when the user didn't provide a pre-computed Gram matrix. Such a case, where in a call to LassoCV().fit() the Gram matrix is internally pre-computed, passed through and the validation fails is described in #21997:

# Example3
from sklearn.linear_model import LassoCV
import numpy as np

m = LassoCV()

np.random.seed(seed=3)

X = np.random.random((10000, 50)).astype(np.float32)
X[:, 25] = np.where(X[:, 25] < 0.98, 0, 1)
X[:, 26] = np.where(X[:, 26] < 0.98, 0, 1)
y = np.random.random((10000, 1)).astype(np.float32)

m.fit(X, y)

The proposed fix: As suggested by @agramfort, the check for the Gram matrix is now done with dtype-aware tolerance. This is achieved by calling sklearn.utils._testing.assert_allclose with rtol=None, see #22059 (comment). For this the default rtol of _check_precomputed_gram_matrix was changed to None. The docstring has been adapted accordingly.

Unit test: Based on one of the above examples, I added a new unit test being sensitive for the described problem.

Any other comments?

@MalteKurz MalteKurz changed the title Fix gram validation by always appliying np.dot to two 2d arrays Fix gram validation by always applying np.dot to two 2d arrays Dec 22, 2021
@MalteKurz
Copy link
Contributor Author

@agramfort
Copy link
Member

what makes little sense to me is to use the same default rtol and atol for float32 and float64

a solution along those lines:

diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
index 652556cf1e..4de396c33d 100644
--- a/sklearn/linear_model/_base.py
+++ b/sklearn/linear_model/_base.py
@@ -769,6 +769,10 @@ def _check_precomputed_gram_matrix(
     v1 = (X[:, f1] - X_offset[f1]) * X_scale[f1]
     v2 = (X[:, f2] - X_offset[f2]) * X_scale[f2]

+    if v1.itemsize == 4:  # single precision
+        rtol = np.sqrt(rtol)
+        atol = np.sqrt(atol)
+
     expected = np.dot(v1, v2)
     actual = precompute[f1, f2]

seems to fix your problem.

as it is it would require to update docstring etc but that would be my approach.

@MalteKurz
Copy link
Contributor Author

what makes little sense to me is to use the same default rtol and atol for float32 and float64

a solution along those lines:

diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
index 652556cf1e..4de396c33d 100644
--- a/sklearn/linear_model/_base.py
+++ b/sklearn/linear_model/_base.py
@@ -769,6 +769,10 @@ def _check_precomputed_gram_matrix(
     v1 = (X[:, f1] - X_offset[f1]) * X_scale[f1]
     v2 = (X[:, f2] - X_offset[f2]) * X_scale[f2]

+    if v1.itemsize == 4:  # single precision
+        rtol = np.sqrt(rtol)
+        atol = np.sqrt(atol)
+
     expected = np.dot(v1, v2)
     actual = precompute[f1, f2]

seems to fix your problem.

as it is it would require to update docstring etc but that would be my approach.

Thanks @agramfort for the feedback. I agree that your proposal is the better solution. However, I wonder whether one could just call sklearn.utils._testing.assert_allclose for testing, see

def assert_allclose(
actual, desired, rtol=None, atol=0.0, equal_nan=True, err_msg="", verbose=True
):
"""dtype-aware variant of numpy.testing.assert_allclose

It is already data-type aware and for example also used in the module https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/estimator_checks.py.

--> The function _check_precomputed_gram_matrix is anyways only used at one specific place with default rtol and atol. In my opinion there are now two solutions based on the above metioned sklearn.utils._testing.assert_allclose:
A) Change the default from _check_precomputed_gram_matrix to rtol=None and pass through rtol and atol to the data-type aware assert_allclose. This would require an update of the docstring.
B) Remove the rtol and atol parameters from _check_precomputed_gram_matrix and completely rely on sklearn.utils._testing.assert_allclose with its default rtol and atol.

@agramfort
Copy link
Member

no objection to use sklearn.utils._testing.assert_allclose

sorry for the slow reaction time @MalteKurz

@MalteKurz MalteKurz changed the title Fix gram validation by always applying np.dot to two 2d arrays Fix gram validation: dtype-aware tolerance May 30, 2022
@MalteKurz
Copy link
Contributor Author

no objection to use sklearn.utils._testing.assert_allclose

sorry for the slow reaction time @MalteKurz

@agramfort No worries, thanks for the feedback.

I adapted the code accordingly, such that the the validation is now dtype-aware. The description of this PR (#22059 (comment)) was also adapted in this lines and should now be ready for review.

@MalteKurz MalteKurz marked this pull request as ready for review May 30, 2022 09:14
Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MalteKurz can you add a what's new entry?

🙏 for taking a stab at this.

rtol : float, default=None
Relative tolerance; see numpy.allclose and
sklearn.utils._testing.assert_allclose.
If None, it is set based on the provided arrays' dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explicit on the number. Before it was clear it was 1e-7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted the PR accordingly (see 9d2ae61) and now explicitly state the rtol values in the docstring

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

f"{expected} but the user-supplied value was "
f"{actual}."
)
assert_allclose(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the exception class to ValueError. Also importing from _testing in non-testing code has the consequence of importing pytest (if pytest is installed). Loading an unused module can increase the startup time of programs.

I am in favor of copying the rtol selection logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted the PR accordingly (see 9d2ae61):

  • prevent import from utils._testing
  • copy-paste rtol logic
  • go back to ValueError

@@ -414,7 +414,6 @@ def assert_allclose(
If None, it is set based on the provided arrays' dtypes.
atol : float, optional, default=0.
Absolute tolerance.
If None, it is set based on the provided arrays' dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a separate PR for this? This would keep this PR focused on fixing _check_precomputed_gram_matrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, see #23555.

@MalteKurz
Copy link
Contributor Author

MalteKurz commented Jun 7, 2022

@MalteKurz can you add a what's new entry?

pray for taking a stab at this.

@agramfort A what's new entry was added in 1acfd42.

@MalteKurz MalteKurz requested a review from thomasjpfan June 7, 2022 09:31
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title Fix gram validation: dtype-aware tolerance FIX Fix gram validation: dtype-aware tolerance Jun 7, 2022
@thomasjpfan thomasjpfan merged commit 79c176d into scikit-learn:main Jun 7, 2022
ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validation check for precomputed gram matrix fails erroneously when using float32 data
3 participants