FIX Fix gram validation: dtype-aware tolerance #22059

MalteKurz · 2021-12-22T13:01:07Z

Reference Issues/PRs

Fixes #21997. See also #19004.

What does this implement/fix? Explain your changes.

In #19004 validations for user-supplied gram matrices have been introduced. A core part of the check is based on re-computing a single element of the gram matrix and checking whether it is equal to the provided entry of the Gram matrix.

The check can cause "false alarms" because there is no guarantee that np.dot always delivers "numerically" the same result if either called with two 2d-arrays (specific element extracted) or with two 1d-arrays corresponding to that specific element. Two examples are provided in the following

import numpy as np
from sklearn.linear_model import ElasticNet

# Example 1
rng = np.random.RandomState(0)
X = rng.binomial(1, 0.25, (100, 2)).astype(np.float32)
y = rng.random(100).astype(np.float32)
X_c = X - np.average(X, axis=0)

print(np.dot(X_c.T, X_c)[1, 1])
print(np.dot(X_c[:, 1], X_c[:, 1]))

precompute = np.dot(X_c.T, X_c)
ElasticNet(precompute=precompute).fit(X_c, y)

# Example 2
rng = np.random.RandomState(58)
X = rng.binomial(1, 0.25, (1000, 4)).astype(np.float32)
y = rng.random(1000).astype(np.float32)
X_c = X - np.average(X, axis=0)

print(np.dot(X_c.T, X_c)[2, 3])
print(np.dot(X_c[:, 2], X_c[:, 3]))

precompute = np.dot(X_c.T, X_c)
ElasticNet(precompute=precompute).fit(X_c, y)

As a consequence, the validation can even kick in when the user didn't provide a pre-computed Gram matrix. Such a case, where in a call to LassoCV().fit() the Gram matrix is internally pre-computed, passed through and the validation fails is described in #21997:

# Example3
from sklearn.linear_model import LassoCV
import numpy as np

m = LassoCV()

np.random.seed(seed=3)

X = np.random.random((10000, 50)).astype(np.float32)
X[:, 25] = np.where(X[:, 25] < 0.98, 0, 1)
X[:, 26] = np.where(X[:, 26] < 0.98, 0, 1)
y = np.random.random((10000, 1)).astype(np.float32)

m.fit(X, y)

The proposed fix: As suggested by @agramfort, the check for the Gram matrix is now done with dtype-aware tolerance. This is achieved by calling sklearn.utils._testing.assert_allclose with rtol=None, see #22059 (comment). For this the default rtol of _check_precomputed_gram_matrix was changed to None. The docstring has been adapted accordingly.

Unit test: Based on one of the above examples, I added a new unit test being sensitive for the described problem.

Any other comments?

An alternative fix for validation check for precomputed gram matrix fails erroneously when using float32 data #21997 was proposed in fix skip unnecessary check of gram_matrix in coordinate_descent #22008: There the proposed solution is to skip the Gram matrix validation in situations where the Gram matrix was internally computed and not user-provided. This would fix Example 3 from above, but not Examples 1 and 2 from above.
Miscellaneous: This PR also contains a fix proposal for a minor issue in the docstring of sklearn.utils._testing.assert_allclose. There, for rtol as well as atol it was stated that "If None, it is set based on the provided arrays' dtypes.". However, looking into the code it becomes clear that only rtol is dtype-aware (if set to None) but not atol. Therefore, I removed the statement from the atol docstring, see 60372ce

… no attribute 'random'

MalteKurz · 2021-12-22T15:14:08Z

Converted this into a draft PR because of a failing unit test https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=36348&view=logs&j=18b0749f-dd9a-5274-d197-77895e43d4e4&t=ba53dc33-2c0b-592b-6f69-b1c7af7ca977. See also the open problem here #21997 (comment).

agramfort · 2021-12-25T16:58:08Z

what makes little sense to me is to use the same default rtol and atol for float32 and float64

a solution along those lines:

diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
index 652556cf1e..4de396c33d 100644
--- a/sklearn/linear_model/_base.py
+++ b/sklearn/linear_model/_base.py
@@ -769,6 +769,10 @@ def _check_precomputed_gram_matrix(
     v1 = (X[:, f1] - X_offset[f1]) * X_scale[f1]
     v2 = (X[:, f2] - X_offset[f2]) * X_scale[f2]

+    if v1.itemsize == 4:  # single precision
+        rtol = np.sqrt(rtol)
+        atol = np.sqrt(atol)
+
     expected = np.dot(v1, v2)
     actual = precompute[f1, f2]

seems to fix your problem.

as it is it would require to update docstring etc but that would be my approach.

MalteKurz · 2022-05-23T09:27:42Z

what makes little sense to me is to use the same default rtol and atol for float32 and float64

a solution along those lines:
diff --git a/sklearn/linear_model/_base.py b/sklearn/linear_model/_base.py
index 652556cf1e..4de396c33d 100644
--- a/sklearn/linear_model/_base.py
+++ b/sklearn/linear_model/_base.py
@@ -769,6 +769,10 @@ def _check_precomputed_gram_matrix(
     v1 = (X[:, f1] - X_offset[f1]) * X_scale[f1]
     v2 = (X[:, f2] - X_offset[f2]) * X_scale[f2]

+    if v1.itemsize == 4:  # single precision
+        rtol = np.sqrt(rtol)
+        atol = np.sqrt(atol)
+
     expected = np.dot(v1, v2)
     actual = precompute[f1, f2]
seems to fix your problem.

as it is it would require to update docstring etc but that would be my approach.

Thanks @agramfort for the feedback. I agree that your proposal is the better solution. However, I wonder whether one could just call sklearn.utils._testing.assert_allclose for testing, see

scikit-learn/sklearn/utils/_testing.py

Lines 390 to 393 in 735d3c6

    
           def assert_allclose( 
        
               actual, desired, rtol=None, atol=0.0, equal_nan=True, err_msg="", verbose=True 
        
           ): 
        
               """dtype-aware variant of numpy.testing.assert_allclose

It is already data-type aware and for example also used in the module https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/estimator_checks.py.

--> The function _check_precomputed_gram_matrix is anyways only used at one specific place with default rtol and atol. In my opinion there are now two solutions based on the above metioned sklearn.utils._testing.assert_allclose:
A) Change the default from _check_precomputed_gram_matrix to rtol=None and pass through rtol and atol to the data-type aware assert_allclose. This would require an update of the docstring.
B) Remove the rtol and atol parameters from _check_precomputed_gram_matrix and completely rely on sklearn.utils._testing.assert_allclose with its default rtol and atol.

agramfort · 2022-05-28T14:37:01Z

no objection to use sklearn.utils._testing.assert_allclose

sorry for the slow reaction time @MalteKurz

… matrices

…are, see https://github.com/scikit-learn/scikit-learn/blob/fbb9028ce8466d3db1870ecb56b8b9d948607116/sklearn/utils/_testing.py#L390-L393

…ypes

…ed instead of an ValueError

MalteKurz · 2022-05-30T09:14:08Z

no objection to use sklearn.utils._testing.assert_allclose

sorry for the slow reaction time @MalteKurz

@agramfort No worries, thanks for the feedback.

I adapted the code accordingly, such that the the validation is now dtype-aware. The description of this PR (#22059 (comment)) was also adapted in this lines and should now be ready for review.

agramfort

@MalteKurz can you add a what's new entry?

🙏 for taking a stab at this.

agramfort · 2022-06-04T13:23:26Z

sklearn/linear_model/_base.py

+    rtol : float, default=None
+        Relative tolerance; see numpy.allclose and
+        sklearn.utils._testing.assert_allclose.
+        If None, it is set based on the provided arrays' dtypes.


Please explicit on the number. Before it was clear it was 1e-7

I adapted the PR accordingly (see 9d2ae61) and now explicitly state the rtol values in the docstring

thomasjpfan

Thank you for the PR!

thomasjpfan · 2022-06-04T15:07:04Z

sklearn/linear_model/_base.py

+        f"{expected} but the user-supplied value was "
+        f"{actual}."
+    )
+    assert_allclose(


This changes the exception class to ValueError. Also importing from _testing in non-testing code has the consequence of importing pytest (if pytest is installed). Loading an unused module can increase the startup time of programs.

I am in favor of copying the rtol selection logic here.

I adapted the PR accordingly (see 9d2ae61):

prevent import from utils._testing

copy-paste rtol logic

go back to ValueError

thomasjpfan · 2022-06-04T15:09:30Z

sklearn/utils/_testing.py

@@ -414,7 +414,6 @@ def assert_allclose(
        If None, it is set based on the provided arrays' dtypes.
    atol : float, optional, default=0.
        Absolute tolerance.
-        If None, it is set based on the provided arrays' dtypes.


Can you open a separate PR for this? This would keep this PR focused on fixing _check_precomputed_gram_matrix.

Done, see #23555.

…m-validation

… ValueError; explicitly state the rtol in the docstring

MalteKurz · 2022-06-07T07:42:15Z

@MalteKurz can you add a what's new entry?

pray for taking a stab at this.

@agramfort A what's new entry was added in 1acfd42.

thomasjpfan

LGTM

MalteKurz added 2 commits December 22, 2021 13:27

stack vectors such that np.dot is always applied to two 2d arrays

35ed624

add a unit test being sensitve for #21997

6675e57

github-actions bot added the module:linear_model label Dec 22, 2021

MalteKurz changed the title ~~Fix gram validation by always appliying np.dot to two 2d arrays~~ Fix gram validation by always applying np.dot to two 2d arrays Dec 22, 2021

fix unit test failure AttributeError: 'mtrand.RandomState' object has…

45adbea

… no attribute 'random'

MalteKurz mentioned this pull request Dec 22, 2021

validation check for precomputed gram matrix fails erroneously when using float32 data #21997

Closed

MalteKurz marked this pull request as draft December 22, 2021 15:13

MalteKurz mentioned this pull request Mar 9, 2022

[Bug]: ValueError Gram matrix in 401k example DoubleML/doubleml-for-py#142

Closed

MalteKurz added 8 commits May 30, 2022 09:03

Merge remote-tracking branch 'upstram/main' into fix-gram-validation

41de9b4

use sklearn.utils._testing.assert_allclose to check pre-computed gram…

9986079

… matrices

set default rtol to None such that the tolerance is becoming dtype-aw…

7901ce7

…are, see https://github.com/scikit-learn/scikit-learn/blob/fbb9028ce8466d3db1870ecb56b8b9d948607116/sklearn/utils/_testing.py#L390-L393

fix docstring: atol is not being set based on the provided arrays' dt…

60372ce

…ypes

fix line too long (linting)

3d87548

apply black

bdc673a

with sklearn.utils._testing.assert_allclose an AssertionError is rais…

8cf785d

…ed instead of an ValueError

fix line too long in docstring

48d1146

MalteKurz changed the title ~~Fix gram validation by always applying np.dot to two 2d arrays~~ Fix gram validation: dtype-aware tolerance May 30, 2022

MalteKurz marked this pull request as ready for review May 30, 2022 09:14

agramfort reviewed Jun 4, 2022

View reviewed changes

thomasjpfan reviewed Jun 4, 2022

View reviewed changes

Merge branch 'main' of github.com:MalteKurz/scikit-learn into fix-gra…

83ba48f

…m-validation

MalteKurz mentioned this pull request Jun 7, 2022

DOC fix docstring of assert_allclose (atol is not being set based on the provided arrays' dtypes) #23555

Merged

MalteKurz added 2 commits June 7, 2022 08:53

prevent import from utils._testing; copy-paste rtol logic; go back to…

9d2ae61

… ValueError; explicitly state the rtol in the docstring

add a what's new entry

1acfd42

MalteKurz added 2 commits June 7, 2022 09:53

fix title underline too short

962f6ea

change of assert_allclose docstring was moved to #23555

c0da61b

agramfort approved these changes Jun 7, 2022

View reviewed changes

MalteKurz requested a review from thomasjpfan June 7, 2022 09:31

thomasjpfan approved these changes Jun 7, 2022

View reviewed changes

thomasjpfan changed the title ~~Fix gram validation: dtype-aware tolerance~~ FIX Fix gram validation: dtype-aware tolerance Jun 7, 2022

thomasjpfan merged commit 79c176d into scikit-learn:main Jun 7, 2022

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022

FIX Fix gram validation: dtype-aware tolerance (scikit-learn#22059)

99a5ccf

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

FIX Fix gram validation: dtype-aware tolerance (scikit-learn#22059)

7d17d6d

glemaitre pushed a commit that referenced this pull request Aug 5, 2022

FIX Fix gram validation: dtype-aware tolerance (#22059)

00bccdf

Uh oh!

FIX Fix gram validation: dtype-aware tolerance #22059

FIX Fix gram validation: dtype-aware tolerance #22059

Uh oh!

Conversation

MalteKurz commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

MalteKurz commented Dec 22, 2021

Uh oh!

agramfort commented Dec 25, 2021

Uh oh!

MalteKurz commented May 23, 2022

Uh oh!

agramfort commented May 28, 2022

Uh oh!

MalteKurz commented May 30, 2022

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

agramfort Jun 4, 2022

Choose a reason for hiding this comment

Uh oh!

MalteKurz Jun 7, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jun 4, 2022

Choose a reason for hiding this comment

Uh oh!

MalteKurz Jun 7, 2022

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jun 4, 2022

Choose a reason for hiding this comment

Uh oh!

MalteKurz Jun 7, 2022

Choose a reason for hiding this comment

Uh oh!

MalteKurz commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MalteKurz commented Dec 22, 2021 •

edited

Loading

MalteKurz commented Jun 7, 2022 •

edited

Loading