TST use global_random_seed in sklearn/_loss/tests/test_link.py #23751

marenwestermann · 2022-06-24T13:32:56Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I used global_random_seed in the functions test_link_inverse_identity and test_link_out_argument. With respect to both tests, this caused a number of test failures, all related to the default value of rtol in assert_allclose. I increased this value in both tests which made all tests pass. Please check if the increased tolerance is acceptable.

Any other comments?

ping @lorentzenchr

test_link_inverse_identity test_link_out_argument

lorentzenchr

@marenwestermann Thanks for working on this.

lorentzenchr · 2022-06-24T14:13:27Z

sklearn/_loss/tests/test_link.py

@@ -73,15 +73,15 @@ def test_link_inverse_identity(link):
        # we do not need to distinguish.
        raw_prediction = rng.normal(loc=0, scale=10, size=(n_samples))

-    assert_allclose(link.link(link.inverse(raw_prediction)), raw_prediction)
+    assert_allclose(link.link(link.inverse(raw_prediction)), raw_prediction, rtol=1e-2)


This rtol is a bit high. Is it specific for a certain link function?

Agreed. Good point regarding the link function. I checked and all test failures (in both tests) are related to the LogitLink link function. Should it therefore maybe be tested separately?

I thinks raw_prediction = rng.normal(loc=0, scale=10) sometimes creates very large (or very negative) values. Logit link then takes the exponential and the logarithms which leads to loss of precision.

We can use different random numbers, how about rng.uniform(-30, 30)?

I think this is worth trying. We should document as well with a comment this behaviour and why we do this choice. This is not a trivial thing.

I tried your suggestion but it results in the same test failures compared to using rng.normal, i.e. all test failures are related to the LogitLink function and the failures occur with the same random seeds.

Let's go with rng.uniform(-20, 20). The reason is that for the logit link

from scipy.special import expit, logit x=20 logit(expit(x))

the term expit(x) comes very close to 1 for large positive x and therefore loses precision. For large negative values, expit(x) is close to zero an therefore does not lose precision.
With float64, logit(expit(20)) is correct to about rtol 2e-9.

sklearn/_loss/tests/test_link.py

marenwestermann · 2022-06-28T09:57:07Z

I found a solution by using global_random_seed % 10 and setting rtol=1e-06 in both tests. I'm aware that this is not an ideal solution.

marenwestermann · 2022-06-28T11:06:58Z

Also, to debug the issue, here's a minimal reproducible of one of the failing tests:

import numpy as np
import pandas as pd
from numpy.testing import assert_allclose
from sklearn._loss.link import (
    _LINKS,
    MultinomialLogit,
)

rng = np.random.RandomState(8)
link = _LINKS['logit']()
n_samples, n_classes = 100, None
if link.is_multiclass:
    n_classes = 10
    raw_prediction = rng.normal(loc=0, scale=10, size=(n_samples, n_classes))
    if isinstance(link, MultinomialLogit):
        raw_prediction = link.symmetrize_raw_prediction(raw_prediction)
else:
    # So far, the valid interval of raw_prediction is (-inf, inf) and
    # we do not need to distinguish.
    raw_prediction = rng.normal(loc=0, scale=10, size=(n_samples))

print(f"stats: {pd.DataFrame(raw_prediction).describe()}")
assert_allclose(link.link(link.inverse(raw_prediction)), raw_prediction)

which fails with:

stats:                 0
count  100.000000
mean     0.616100
std     11.310588
min    -31.349198
25%     -6.079379
50%      0.878574
75%      7.340690
max     25.450872
Traceback (most recent call last):
  File "/Users/maren/tmp/1.py", line 23, in <module>
    assert_allclose(link.link(link.inverse(raw_prediction)), raw_prediction)
  File "/Users/maren/mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/Users/maren/mambaforge/envs/sklearn-dev/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 3 / 100 (3%)
Max absolute difference: 4.60124204e-06
Max relative difference: 1.85067717e-07
 x: array([ 9.120472e-01,  1.091283e+01, -1.946970e+01, -1.386350e+01,
       -2.296492e+01,  2.409834e+01,  1.727836e+01,  2.204556e+01,
        7.948276e+00,  9.764211e+00, -1.183427e+01,  1.916364e+01,...
 y: array([ 9.120472e-01,  1.091283e+01, -1.946970e+01, -1.386350e+01,
       -2.296492e+01,  2.409834e+01,  1.727836e+01,  2.204556e+01,
        7.948276e+00,  9.764211e+00, -1.183427e+01,  1.916364e+01,...

The maximum number of mismatched elements in the failing tests is 3 (as shown in this example).

lorentzenchr

Almost there:wink:

sklearn/_loss/tests/test_link.py

glemaitre · 2022-06-30T13:01:40Z

Thanks @marenwestermann LGTM.

marenwestermann · 2022-06-30T13:13:39Z

Thanks a lot for the reviews, I learned a lot! :)

lorentzenchr · 2022-06-30T14:16:42Z

Thanks @marenwestermann 🙏

…t-learn#23751) Co-authored-by: Maren Westermann <maren.westermann@free-now.com>

test_link [all random seeds]

d19e56b

test_link_inverse_identity test_link_out_argument

lorentzenchr reviewed Jun 24, 2022

View reviewed changes

glemaitre self-requested a review June 28, 2022 09:10

use global_random_seed % 10

a9b543c

glemaitre added No Changelog Needed labels Jun 28, 2022

glemaitre removed their request for review June 28, 2022 09:50

revert changes in test_link_out_argument function

58e3f32

draw samples from uniform distribution

e10032f

lorentzenchr approved these changes Jun 29, 2022

View reviewed changes

sklearn/_loss/tests/test_link.py Show resolved Hide resolved

Maren Westermann added 2 commits June 30, 2022 10:53

add documentation

5226eba

correct documentation

22f9b2f

glemaitre merged commit 96b91d7 into scikit-learn:main Jun 30, 2022

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022

TST use global_random_seed in sklearn/_loss/tests/test_link.py (sciki…

8434802

…t-learn#23751) Co-authored-by: Maren Westermann <maren.westermann@free-now.com>

marenwestermann deleted the test-link-global-random-seed branch March 7, 2023 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST use global_random_seed in sklearn/_loss/tests/test_link.py #23751

TST use global_random_seed in sklearn/_loss/tests/test_link.py #23751

marenwestermann commented Jun 24, 2022

lorentzenchr left a comment

lorentzenchr Jun 24, 2022

marenwestermann Jun 27, 2022

lorentzenchr Jun 27, 2022

glemaitre Jun 28, 2022

marenwestermann Jun 28, 2022

lorentzenchr Jun 28, 2022

marenwestermann commented Jun 28, 2022

marenwestermann commented Jun 28, 2022

lorentzenchr left a comment

glemaitre commented Jun 30, 2022

marenwestermann commented Jun 30, 2022

lorentzenchr commented Jun 30, 2022

TST use global_random_seed in sklearn/_loss/tests/test_link.py #23751

TST use global_random_seed in sklearn/_loss/tests/test_link.py #23751

Conversation

marenwestermann commented Jun 24, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr Jun 24, 2022

Choose a reason for hiding this comment

marenwestermann Jun 27, 2022

Choose a reason for hiding this comment

lorentzenchr Jun 27, 2022

Choose a reason for hiding this comment

glemaitre Jun 28, 2022

Choose a reason for hiding this comment

marenwestermann Jun 28, 2022

Choose a reason for hiding this comment

lorentzenchr Jun 28, 2022

Choose a reason for hiding this comment

marenwestermann commented Jun 28, 2022

marenwestermann commented Jun 28, 2022

lorentzenchr left a comment

Choose a reason for hiding this comment

glemaitre commented Jun 30, 2022

marenwestermann commented Jun 30, 2022

lorentzenchr commented Jun 30, 2022