-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not dtype-parametrize too many tests to avoid growing the number of tests to run too large and make running the test suite too slow.
But for the tests we do dtype parametrize, we should add more assertions to actually check how the model is expected to react to a change of dtype on its outputs and public fitted attributes.
In particular we should check the shape and dtype of the public attribute embedding_
which is not well covered by the existing test suite.
@@ -296,10 +305,12 @@ def test_preserve_trustworthiness_approximately(method, init): | |||
assert t > 0.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TSNE
is not in the list of #11000 and check_transformer_preserve_dtypes
is not called on it because TSNE
does not have a transform
method (only fit_transform
) and therefore is not a "real" transformer.
So I think that we should manually test in this file, at least in one of the tests of this file that the output of fit_transform
preserves the dtype:
assert t > 0.85 | |
assert t > 0.85 | |
assert X_embedded.dtype == dtype |
I have checked locally and this should always hold.
Note: I cannot make the suggestion on a new line right after the definition of the X_embedded
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this will not work. Apparently our code always works with float32 precision. I think this should be documented in the docstring as this is behavior is quite specific to this estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does preserves dtypes when we provide init
to TSNE as a float64 numpy array, but in other cases it does not.
Everything depends on the initialisation of X_embedded
:
scikit-learn/sklearn/manifold/_t_sne.py
Lines 967 to 993 in ae98150
if isinstance(self._init, np.ndarray): | |
X_embedded = self._init | |
elif self._init == "pca": | |
pca = PCA( | |
n_components=self.n_components, | |
svd_solver="randomized", | |
random_state=random_state, | |
) | |
X_embedded = pca.fit_transform(X).astype(np.float32, copy=False) | |
# TODO: Update in 1.2 | |
# PCA is rescaled so that PC1 has standard deviation 1e-4 which is | |
# the default value for random initialization. See issue #18018. | |
warnings.warn( | |
"The PCA initialization in TSNE will change to " | |
"have the standard deviation of PC1 equal to 1e-4 " | |
"in 1.2. This will ensure better convergence.", | |
FutureWarning, | |
) | |
# X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4 | |
elif self._init == "random": | |
# The embedding is initialized with iid samples from Gaussians with | |
# standard deviation 1e-4. | |
X_embedded = 1e-4 * random_state.standard_normal( | |
size=(n_samples, self.n_components) | |
).astype(np.float32) | |
else: | |
raise ValueError("'init' must be 'pca', 'random', or a numpy array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we should be documenting this behaviour in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the Notes
section of the docstring of TSNE.
sklearn/manifold/tests/test_t_sne.py
Outdated
@@ -480,11 +496,12 @@ def test_init_not_available(): | |||
tsne.fit_transform(np.array([[0.0], [1.0]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also it would be interesting to check the dtype (and shape) of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just tests that the call errs. Why would we need to check the dtype and shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jjerphan. We can keep the test to be focused on the error checking.
Actually Maybe it could be the case that the rounding errors of distance graph computation are not deterministic on macOS? EDIT: I tried to reproduce locally on my macOS machine and I cannot reproduce... I reran the test |
test_t_sne.py
to test implementations on 32bit datasetssklearn/manifold/tests/test_t_sne.py
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/manifold/tests/test_t_sne.py
sklearn/manifold/tests/test_t_sne.py
Outdated
@@ -231,7 +235,7 @@ def test_binary_perplexity_stability(): | |||
assert_array_almost_equal(P1, last_P1, decimal=4) | |||
|
|||
|
|||
def test_gradient(): | |||
def test_gradient(global_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that _kl_divergence
is converting the inputs to 32 bits.
Therefore, I don't think that we need to pass the global_dtype
for this function.
We should probably preserver dtype
and revisit these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this test passes. I guess we can rework the test a bit once TSNE supports float64 (this is constrained by TSNE Cython routines which only operates on float
at the moment). What do you think?
@@ -296,10 +305,12 @@ def test_preserve_trustworthiness_approximately(method, init): | |||
assert t > 0.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we should be documenting this behaviour in the docstring.
sklearn/manifold/tests/test_t_sne.py
Outdated
@@ -480,11 +496,12 @@ def test_init_not_available(): | |||
tsne.fit_transform(np.array([[0.0], [1.0]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jjerphan. We can keep the test to be focused on the error checking.
@@ -806,7 +816,7 @@ def test_kl_divergence_not_nan(method): | |||
assert not np.isnan(tsne.kl_divergence_) | |||
|
|||
|
|||
def test_barnes_hut_angle(): | |||
def test_barnes_hut_angle(global_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we should be adding the dtype here since some of the components will convert to 32 bits (e.g. _kl_divergence
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to #22675 (comment): For now, this test passes. I guess we can rework the test a bit once TSNE supports float64 (this is constrained by TSNE Cython routines which only operates on float at the moment). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of more comments
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
The parameter validation now tests this.
@ogrisel and @glemaitre: do you think this is mergeable? |
After having discussed it IRL, we think TSNE support for both Hence closing this PR for now, but we might reopen it later. |
Reference Issues/PRs
Partially addresses #22881
Precedes #22590
What does this implement/fix? Explain your changes.
This parametrizes tests from
test_t_sne.py
to run on 32bit datasets.Any other comments?
We could introduce a mechanism to be able to able to remove tests' execution on 32bit datasets if this takes too much time to complete.