-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST use global_random_seed in sklearn/decomposition/tests/test_sparse_pca.py
#31213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test_fit_transform test_fit_transform_parallel test_fit_transform_tall test_initialization test_scaling_fit_transform test_pca_vs_spca test_sparse_pca_inverse_transform test_transform_inverse_transform_round_trip
Testing with all the seed values were passing locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(no idea about test_mini_batch_fit_transform though ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can add global_random_seed
to
test_transform_nan
test_sparse_pca_numerical_consistency
spca = SparsePCA( | ||
alpha=0, ridge_alpha=0, n_components=2, random_state=global_random_seed | ||
) | ||
pca = PCA(n_components=2, random_state=global_random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required to start from the same seedin the test, is it ?
spca = SparsePCA( | |
alpha=0, ridge_alpha=0, n_components=2, random_state=global_random_seed | |
) | |
pca = PCA(n_components=2, random_state=global_random_seed) | |
spca = SparsePCA( | |
alpha=0, ridge_alpha=0, n_components=2, random_state=rng | |
) | |
pca = PCA(n_components=2, random_state=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
I need to ask (maybe @glemaitre ?) why sometimes it's ok to use rng
(from np.random.RandomState(global_random_seed)
).. and sometimes the same seed -global_random_seed
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_random_seed
is a number like 0, 1, 42, ...
np.random.RandomState
is an object with a state. When you ask it to generate a random value its state changes, so the next time you ast it again it gives a different value.
If you pass an integer as random state to an estimator, it's used to create and seed a RandomState object. Therefore, each time it's fitted, it starts from the same seed and produces the same results.
So to answer your question, it depends if need to generate multiple times the exact same sequence of generated value or if it doesn't matter.
For instance, if you want to compare the results of an estimator between float32 and float64 input, you want that both estimators use the same sequence of generated values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @jeremiedbb... I see that my question wasn't clear.
In the file modified with this PR, some tests set SparcePCA
parameter random_state
as random_state=rng
, where rng = np.random.RandomState(global_random_seed)
. For example test_fit_transform_tall
.
But some tests set the random_state
parameter as random_state=global_random_seed
. Like test_sparse_pca_numerical_consistency
.
I guess it doesn't matter as there is this check_random_state(seed)
that "Turn seed into a np.random.RandomState instance".
In the end they might both be the same, except that changing random_state of test_sparse_pca_numerical_consistency
to random_state=rng
was failing.
Anyway, your previous explanation confirmed my understanding. :-)
Let's just remove it. It's useless to keep dead code in the code base. |
I added the seed to those. |
test_fit_transform test_fit_transform_parallel test_transform_nan test_fit_transform_tall test_initialization test_scaling_fit_transform test_pca_vs_spca test_sparse_pca_numerical_consistency test_sparse_pca_inverse_transform test_transform_inverse_transform_round_trip
Looking a bit, it seems to be an issue with the convergence criterion of DictionaryLearning (which is used by SparsePCA). It's not protected against rounding errors leading to negative values, so it may happen that in one case the algorithm keeps running for a few more iterations than in the other case. It's not a major issue and not really related to this PR so I just switched the dataset for a more easy one for SparsePCA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks
…e_pca.py` (scikit-learn#31213) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Reference Issues/PRs
Towards #22827
What does this implement/fix? Explain your changes.
Any other comments?
I wonder why
test_mini_batch_fit_transform
is kept. It is skipped all the time. It comes from PR #12253cc @glemaitre