-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST use global_random_seed in sklearn/decomposition/tests/test_pca.py #26403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
test_pca_explained_variance_equivalence_solver uses global_random_seed fixture
test_pca_explained_variance_empirical uses global_random_seed fixture
test_pca_singular_values_consistency uses global_random_seed fixture
test_pca_singular_values uses global_random_seed fixture
test_pca_check_projection uses global_random_seed fixture
test_pca_check_projection_list uses global_random_seed fixture
test_n_components_mle uses global_random_seed fixture
test_pca_dim uses global_random_seed fixture
test_infer_dim_1 uses global_random_seed fixture
test_infer_dim_2 uses global_random_seed fixture
test_infer_dim_3 uses global_random_seed fixture
test_pca_score uses global_random_seed fixture
Random state is passed to PCA constructor, as test parametrization includes randomized svd solver
test_pca_sanity_noise_variance uses global_random_seed fixture
test_pca_score_consistency_solvers uses global_random_seed fixture
test_pca_deterministic_output uses global_random_seed fixture
test_mle_redundant_data uses global_random_seed fixture
test_mle_simple_case uses global_random_seed fixture
test_pca_randomized_svd_n_oversamples uses global_random_seed fixture
test_variance_correctness uses global_random_seed fixture
pca = PCA(n_components=2, svd_solver=svd_solver, random_state=0) | ||
def test_pca_explained_variance_empirical(make_X, svd_solver, global_random_seed): | ||
# parametrized factory make_X and global_random_seed determine dataset X | ||
X = make_X(global_random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of directly using X
in pytest.mark.parametrize
, I use the factory function make_X
to inject global_random_seed
during dataset creation. This is done, because pytest only resolves fixtures within the test function, but not in the decorator.
[ | ||
np.random.RandomState(0).randn(100, 80), | ||
datasets.make_classification(100, 80, n_informative=78, random_state=0)[0], | ||
lambda seed: np.random.RandomState(seed).randn(100, 60), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last assertion in this test makes sure that explained_variance_
is the same as the n_components
largest eigenvalues of the covariance matrix of X
.
For two values of global_random_seed
(17 and 37) this failed with dataset random-data
and solver randomized
. Since the randomized SVD is an approximation, a little bit of seed-sensitivity is not suprising if we compare it to the exact spectrum of the covariance matrix.
Instead of adjusting the tolerance of the assertion, I slightly reduced the number of features for the random-data
setting from 80 to 60.
rng = np.random.RandomState(0) | ||
n, p = 100, 5 | ||
rng = np.random.RandomState(global_random_seed) | ||
n, p = 250, 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only adding the global_random_seed
fixture showed, that this test was seed-sensitive. Since the MLE cannot be expected to yield the "true" number of component for any dataset, this is not surprising.
Increasing the number of samples makes the test seed-insensitive in the 0-99 range.
# Check the consistency of score between solvers | ||
X, _ = datasets.load_digits(return_X_y=True) | ||
pca_full = PCA(n_components=30, svd_solver="full", random_state=0) | ||
pca_other = PCA(n_components=30, svd_solver=svd_solver, random_state=0) | ||
pca_full = PCA(n_components=10, svd_solver="full", random_state=global_random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was seed-sensitive for the given assertion tolerance when svd_solver="randomized"
was used. Again, I think this can be expected when comparing the randomized
solution to the exact solution.
Only comparing the first 10 components (instead of the first 30 components) mitigates this somewhat and the test becomes seed-insensitive in the 0-99 range.
test_pca_inverse uses global_random_seed fixture
Y = pca.transform(X) | ||
Y_inverse = pca.inverse_transform(Y) | ||
assert_allclose(X, Y_inverse, rtol=5e-6) | ||
assert_allclose(X, Y_inverse, rtol=5e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion with strict rtol=5e-6
made the test highly seed-sensitive. Before changing the tolerance, 486 of 600 tests failed after including the global_random_seed
fixture.
In my opinion, rtol=5e-5
still asserts good recovery of X
, since we cannot expect perfect recovery even though the middle column of X
has very low variance.
test_whitening uses global_random_seed fixture
test_whitening initializes RandomState with global_random_seed
@@ -72,7 +72,7 @@ def test_whitening(solver, copy): | |||
assert X.shape == (n_samples, n_features) | |||
|
|||
# the component-wise variance is thus highly varying: | |||
assert X.std(axis=0).std() > 43.8 | |||
assert X.std(axis=0).std() > 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous hardcoded threshold of 43.8
seemed to be tailored to X
generated with seed 0. Reduced this a bit to pass for all seeds in the 0-99 range. Comparing with the new threshold of 40
still confirms highly varying variances.
).fit(X_.copy()) | ||
X_unwhitened = pca.transform(X_) | ||
assert X_unwhitened.shape == (n_samples, n_components) | ||
|
||
# in that case the output components still have varying variances | ||
assert X_unwhitened.std(axis=0).std() == pytest.approx(74.1, rel=1e-1) | ||
assert X_unwhitened.std(axis=0).std() > 69 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, previous comparison probably tailored to seed 0 dataset. I lowered the threshold a bit to pass for all seeds in 0-99 range. Also decided to do greater-than instead of approx equality, as the intent is to confirm varying variances.
iterated_power=7, | ||
n_oversamples=13, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adjusting the hardcoded assertion thresholds in this test, a few tests with randomized
solver still failed assert_allclose(X_whitened, X_whitened2, rtol=5e-4)
. Increasing n_oversamples
from default 10 to 13 solved this without adjusting the assertion tolerance.
test_pca_score3 uses global_random_seed fixture
pca.fit(X) | ||
ll2 = pca.score(X) | ||
assert ll1 > ll2 | ||
|
||
|
||
def test_pca_score3(): | ||
def test_pca_score3(global_random_seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just including the global_random_seed
fixture made the test highly seed sensitive, as for 24 out of 100 seeds, the test score was highest for the ("wrong") model with two components.
I believe the test is supposed to address the following point: A possible strategy to select an appropriate number of components is to compare the likelihood for held-out data, e.g. in a cross-validation scheme as illustrated in this example. However, we cannot expect the test-likelihood of the "true" model to be maximal for every (isolated) pair of train and test datasets, but only on average. So I changed the test to run 10 trials with different train-test pairs for each n_components
and accumulate the test scores.
Too keep the test fast, I also reduced the sample size, as the original sample size of 200 required 15 trials to make the test seed-insensitive in the 0-99 range and with 50 samples only 10 trials are required.
test_pca
@marenwestermann, thanks for labelling this. Since I opened this PR there have been new changes in the affected test file on main. These are included now. I was also wondering if it would make sense to split this PR up a bit? E.g. I could create one PR for all the trivial changes and individual PRs for test functions that did require adaptations to make them seed insensitive, as the latter will require more careful consideration when reviewed. |
Towards #22827
I modified the relevant tests in sklearn/decomposition/tests/test_pca.py to use the
global_random_seed
fixture.Tests that relate to shapes, dtypes, exceptions and warnings were not modified.
All tests run via
pass locally.
While most tests only required minimal adjustments, some tests did require slightly changing datasets or test parameterization. I added comments for the respective code sections below.
The PR only changes a test file, so no addition to change log required?