Skip to content

TST use global_random_seed in sklearn/decomposition/tests/test_pca.py #26403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

sply88
Copy link
Contributor

@sply88 sply88 commented May 19, 2023

Towards #22827

I modified the relevant tests in sklearn/decomposition/tests/test_pca.py to use the global_random_seed fixture.
Tests that relate to shapes, dtypes, exceptions and warnings were not modified.

All tests run via

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py

pass locally.

While most tests only required minimal adjustments, some tests did require slightly changing datasets or test parameterization. I added comments for the respective code sections below.

The PR only changes a test file, so no addition to change log required?

sply88 added 20 commits May 19, 2023 14:31
test_pca_explained_variance_equivalence_solver uses global_random_seed fixture
test_pca_explained_variance_empirical uses global_random_seed fixture
test_pca_singular_values_consistency uses global_random_seed fixture
test_pca_singular_values uses global_random_seed fixture
test_pca_check_projection uses global_random_seed fixture
test_pca_check_projection_list uses global_random_seed fixture
test_n_components_mle uses global_random_seed fixture
test_pca_dim uses global_random_seed fixture
test_infer_dim_1 uses global_random_seed fixture
test_infer_dim_2 uses global_random_seed fixture
test_infer_dim_3 uses global_random_seed fixture
test_pca_score uses global_random_seed fixture
Random state is passed to PCA constructor, as test parametrization includes randomized svd solver
test_pca_sanity_noise_variance uses global_random_seed fixture
test_pca_score_consistency_solvers uses global_random_seed fixture
test_pca_deterministic_output uses global_random_seed fixture
test_mle_redundant_data uses global_random_seed fixture
test_mle_simple_case uses global_random_seed fixture
test_pca_randomized_svd_n_oversamples uses global_random_seed fixture
test_variance_correctness uses global_random_seed fixture
pca = PCA(n_components=2, svd_solver=svd_solver, random_state=0)
def test_pca_explained_variance_empirical(make_X, svd_solver, global_random_seed):
# parametrized factory make_X and global_random_seed determine dataset X
X = make_X(global_random_seed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of directly using X in pytest.mark.parametrize, I use the factory function make_X to inject global_random_seed during dataset creation. This is done, because pytest only resolves fixtures within the test function, but not in the decorator.

[
np.random.RandomState(0).randn(100, 80),
datasets.make_classification(100, 80, n_informative=78, random_state=0)[0],
lambda seed: np.random.RandomState(seed).randn(100, 60),
Copy link
Contributor Author

@sply88 sply88 May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last assertion in this test makes sure that explained_variance_ is the same as the n_components largest eigenvalues of the covariance matrix of X.
For two values of global_random_seed (17 and 37) this failed with dataset random-data and solver randomized. Since the randomized SVD is an approximation, a little bit of seed-sensitivity is not suprising if we compare it to the exact spectrum of the covariance matrix.
Instead of adjusting the tolerance of the assertion, I slightly reduced the number of features for the random-data setting from 80 to 60.

rng = np.random.RandomState(0)
n, p = 100, 5
rng = np.random.RandomState(global_random_seed)
n, p = 250, 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only adding the global_random_seed fixture showed, that this test was seed-sensitive. Since the MLE cannot be expected to yield the "true" number of component for any dataset, this is not surprising.
Increasing the number of samples makes the test seed-insensitive in the 0-99 range.

# Check the consistency of score between solvers
X, _ = datasets.load_digits(return_X_y=True)
pca_full = PCA(n_components=30, svd_solver="full", random_state=0)
pca_other = PCA(n_components=30, svd_solver=svd_solver, random_state=0)
pca_full = PCA(n_components=10, svd_solver="full", random_state=global_random_seed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was seed-sensitive for the given assertion tolerance when svd_solver="randomized" was used. Again, I think this can be expected when comparing the randomized solution to the exact solution.
Only comparing the first 10 components (instead of the first 30 components) mitigates this somewhat and the test becomes seed-insensitive in the 0-99 range.

test_pca_inverse uses global_random_seed fixture
Y = pca.transform(X)
Y_inverse = pca.inverse_transform(Y)
assert_allclose(X, Y_inverse, rtol=5e-6)
assert_allclose(X, Y_inverse, rtol=5e-5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion with strict rtol=5e-6 made the test highly seed-sensitive. Before changing the tolerance, 486 of 600 tests failed after including the global_random_seed fixture.
In my opinion, rtol=5e-5 still asserts good recovery of X, since we cannot expect perfect recovery even though the middle column of X has very low variance.

sply88 added 2 commits May 21, 2023 06:50
test_whitening uses global_random_seed fixture
test_whitening initializes RandomState with global_random_seed
@@ -72,7 +72,7 @@ def test_whitening(solver, copy):
assert X.shape == (n_samples, n_features)

# the component-wise variance is thus highly varying:
assert X.std(axis=0).std() > 43.8
assert X.std(axis=0).std() > 40
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous hardcoded threshold of 43.8 seemed to be tailored to X generated with seed 0. Reduced this a bit to pass for all seeds in the 0-99 range. Comparing with the new threshold of 40 still confirms highly varying variances.

).fit(X_.copy())
X_unwhitened = pca.transform(X_)
assert X_unwhitened.shape == (n_samples, n_components)

# in that case the output components still have varying variances
assert X_unwhitened.std(axis=0).std() == pytest.approx(74.1, rel=1e-1)
assert X_unwhitened.std(axis=0).std() > 69
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, previous comparison probably tailored to seed 0 dataset. I lowered the threshold a bit to pass for all seeds in 0-99 range. Also decided to do greater-than instead of approx equality, as the intent is to confirm varying variances.

iterated_power=7,
n_oversamples=13,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adjusting the hardcoded assertion thresholds in this test, a few tests with randomized solver still failed assert_allclose(X_whitened, X_whitened2, rtol=5e-4). Increasing n_oversamples from default 10 to 13 solved this without adjusting the assertion tolerance.

test_pca_score3 uses global_random_seed fixture
pca.fit(X)
ll2 = pca.score(X)
assert ll1 > ll2


def test_pca_score3():
def test_pca_score3(global_random_seed):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just including the global_random_seed fixture made the test highly seed sensitive, as for 24 out of 100 seeds, the test score was highest for the ("wrong") model with two components.
I believe the test is supposed to address the following point: A possible strategy to select an appropriate number of components is to compare the likelihood for held-out data, e.g. in a cross-validation scheme as illustrated in this example. However, we cannot expect the test-likelihood of the "true" model to be maximal for every (isolated) pair of train and test datasets, but only on average. So I changed the test to run 10 trials with different train-test pairs for each n_components and accumulate the test scores.
Too keep the test fast, I also reduced the sample size, as the original sample size of 200 required 15 trials to make the test seed-insensitive in the 0-99 range and with 50 samples only 10 trials are required.

@sply88 sply88 marked this pull request as ready for review May 22, 2023 17:35
@sply88 sply88 changed the title TST use global_random_seed in sklearn/decomposition/tests/test_pca.py [MRG] TST use global_random_seed in sklearn/decomposition/tests/test_pca.py Jul 1, 2023
@github-actions
Copy link

github-actions bot commented Jul 1, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4162786. Link to the linter CI: here

@sply88 sply88 changed the title [MRG] TST use global_random_seed in sklearn/decomposition/tests/test_pca.py TST use global_random_seed in sklearn/decomposition/tests/test_pca.py Jul 4, 2024
@sply88
Copy link
Contributor Author

sply88 commented Oct 24, 2024

@marenwestermann, thanks for labelling this.

Since I opened this PR there have been new changes in the affected test file on main. These are included now.

I was also wondering if it would make sense to split this PR up a bit? E.g. I could create one PR for all the trivial changes and individual PRs for test functions that did require adaptations to make them seed insensitive, as the latter will require more careful consideration when reviewed.
Also happy to leave this as it is, but figured this approach might speed up integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants