TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675

jjerphan · 2022-03-03T16:02:03Z

Reference Issues/PRs

Partially addresses #22881

Precedes #22590

What does this implement/fix? Explain your changes.

This parametrizes tests from test_t_sne.py to run on 32bit datasets.

Any other comments?

We could introduce a mechanism to be able to able to remove tests' execution on 32bit datasets if this takes too much time to complete.

ogrisel

I think we should not dtype-parametrize too many tests to avoid growing the number of tests to run too large and make running the test suite too slow.

But for the tests we do dtype parametrize, we should add more assertions to actually check how the model is expected to react to a change of dtype on its outputs and public fitted attributes.

In particular we should check the shape and dtype of the public attribute embedding_ which is not well covered by the existing test suite.

sklearn/manifold/tests/test_t_sne.py

ogrisel · 2022-03-04T08:37:28Z

sklearn/manifold/tests/test_t_sne.py

@@ -296,10 +305,12 @@ def test_preserve_trustworthiness_approximately(method, init):
    assert t > 0.85


TSNE is not in the list of #11000 and check_transformer_preserve_dtypes is not called on it because TSNE does not have a transform method (only fit_transform) and therefore is not a "real" transformer.

So I think that we should manually test in this file, at least in one of the tests of this file that the output of fit_transform preserves the dtype:

Suggested change

assert t > 0.85

assert t > 0.85

assert X_embedded.dtype == dtype

I have checked locally and this should always hold.

Note: I cannot make the suggestion on a new line right after the definition of the X_embedded variable.

Actually this will not work. Apparently our code always works with float32 precision. I think this should be documented in the docstring as this is behavior is quite specific to this estimator.

It does preserves dtypes when we provide init to TSNE as a float64 numpy array, but in other cases it does not.

Everything depends on the initialisation of X_embedded:

scikit-learn/sklearn/manifold/_t_sne.py

Lines 967 to 993 in ae98150

if isinstance(self._init, np.ndarray):

X_embedded = self._init

elif self._init == "pca":

pca = PCA(

n_components=self.n_components,

svd_solver="randomized",

random_state=random_state,

)

X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)

# TODO: Update in 1.2

# PCA is rescaled so that PC1 has standard deviation 1e-4 which is

# the default value for random initialization. See issue #18018.

warnings.warn(

"The PCA initialization in TSNE will change to "

"have the standard deviation of PC1 equal to 1e-4 "

"in 1.2. This will ensure better convergence.",

FutureWarning,

)

# X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4

elif self._init == "random":

# The embedding is initialized with iid samples from Gaussians with

# standard deviation 1e-4.

X_embedded = 1e-4 * random_state.standard_normal(

size=(n_samples, self.n_components)

).astype(np.float32)

else:

raise ValueError("'init' must be 'pca', 'random', or a numpy array")

Indeed, we should be documenting this behaviour in the docstring.

Done in the Notes section of the docstring of TSNE.

ogrisel · 2022-03-04T08:44:31Z

sklearn/manifold/tests/test_t_sne.py

@@ -480,11 +496,12 @@ def test_init_not_available():
        tsne.fit_transform(np.array([[0.0], [1.0]]))


Here also it would be interesting to check the dtype (and shape) of the output.

This just tests that the call errs. Why would we need to check the dtype and shape?

I agree with @jjerphan. We can keep the test to be focused on the error checking.

sklearn/manifold/tests/test_t_sne.py

ogrisel · 2022-03-04T09:07:25Z

Actually test_tsne_deprecation_square_distances is failing on float32 on macos machines (but it passes on Linux and Windows), I am not sure why: by reading the code it really do not understand how setting squared_distances could ever have an impact. It seems unused...

Maybe it could be the case that the rounding errors of distance graph computation are not deterministic on macOS?

EDIT: I tried to reproduce locally on my macOS machine and I cannot reproduce... I reran the test test_tsne_deprecation_square_distances[float32] 20 times and it always passes. But my machine is Apple M1 based whereas the CI is Intel-based so maybe it depends on low level BLAS differences?

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/manifold/tests/test_t_sne.py

glemaitre · 2022-06-09T13:06:39Z

sklearn/manifold/tests/test_t_sne.py

@@ -231,7 +235,7 @@ def test_binary_perplexity_stability():
            assert_array_almost_equal(P1, last_P1, decimal=4)


-def test_gradient():
+def test_gradient(global_dtype):


It seems that _kl_divergence is converting the inputs to 32 bits.
Therefore, I don't think that we need to pass the global_dtype for this function.

We should probably preserver dtype and revisit these tests.

For now, this test passes. I guess we can rework the test a bit once TSNE supports float64 (this is constrained by TSNE Cython routines which only operates on float at the moment). What do you think?

sklearn/manifold/tests/test_t_sne.py

glemaitre · 2022-06-09T13:14:13Z

sklearn/manifold/tests/test_t_sne.py

@@ -296,10 +305,12 @@ def test_preserve_trustworthiness_approximately(method, init):
    assert t > 0.85


Indeed, we should be documenting this behaviour in the docstring.

sklearn/manifold/tests/test_t_sne.py

glemaitre · 2022-06-09T13:24:07Z

sklearn/manifold/tests/test_t_sne.py

@@ -480,11 +496,12 @@ def test_init_not_available():
        tsne.fit_transform(np.array([[0.0], [1.0]]))


I agree with @jjerphan. We can keep the test to be focused on the error checking.

sklearn/manifold/tests/test_t_sne.py

glemaitre · 2022-06-09T13:51:07Z

sklearn/manifold/tests/test_t_sne.py

@@ -806,7 +816,7 @@ def test_kl_divergence_not_nan(method):
    assert not np.isnan(tsne.kl_divergence_)


-def test_barnes_hut_angle():
+def test_barnes_hut_angle(global_dtype):


I don't know if we should be adding the dtype here since some of the components will convert to 32 bits (e.g. _kl_divergence)

Similarly to #22675 (comment): For now, this test passes. I guess we can rework the test a bit once TSNE supports float64 (this is constrained by TSNE Cython routines which only operates on float at the moment). What do you think?

sklearn/manifold/tests/test_t_sne.py

glemaitre

a couple of more comments

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

The parameter validation now tests this.

jjerphan · 2022-12-15T08:20:17Z

@ogrisel and @glemaitre: do you think this is mergeable?

jjerphan · 2023-01-10T16:57:59Z

After having discussed it IRL, we think TSNE support for both float32 and float64 datasets must be properly inspected, clarified, and numerically checked before expliciting claiming (by integrating such tests in the codebase) that TSNE supports both float32 and float64 datasets.

Hence closing this PR for now, but we might reopen it later.

TST Adapt test_t_sne.py to test implementations on 32bit datasets

890481c

github-actions bot added the module:manifold label Mar 3, 2022

jjerphan added the No Changelog Needed label Mar 3, 2022

ogrisel reviewed Mar 4, 2022

View reviewed changes

jjerphan added 2 commits March 17, 2022 10:24

Merge branch 'main' into tst/test_t_sne-32bit

7d6519a

Merge branch 'main' into tst/test_t_sne-32bit

ad4bb17

jjerphan changed the title ~~TST Adapt test_t_sne.py to test implementations on 32bit datasets~~ TST use global_dtype in sklearn/manifold/tests/test_t_sne.py Mar 17, 2022

jjerphan and others added 3 commits March 17, 2022 17:26

TST Use global_dtype

85409ce

Address review comments

69a7e7d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

TST Fix parametrisation

8e046aa

jjerphan marked this pull request as ready for review March 17, 2022 17:55

jjerphan mentioned this pull request Mar 17, 2022

Improve tests to make them run on variously typed data using the global_dtype fixture #22881

Open

jjerphan changed the title ~~TST use global_dtype in sklearn/manifold/tests/test_t_sne.py~~ TST use global_dtype in sklearn/manifold/tests/test_t_sne.py Mar 17, 2022

jjerphan added the Waiting for Reviewer label Mar 23, 2022

glemaitre self-requested a review June 9, 2022 12:12

glemaitre removed the Waiting for Reviewer label Jun 9, 2022

glemaitre reviewed Jun 9, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 9, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 9, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 9, 2022

View reviewed changes

jjerphan and others added 7 commits June 10, 2022 13:54

Review comments

1d816be

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

fixup! Review comments

5fb1aea

Merge branch 'main' into tst/test_t_sne-32bit

494b161

Merge branch 'main' into tst/test_t_sne-32bit

e7f7bf9

DOC Adapt tolerance for test_sparse_precomputed_distance

8850014

Merge branch 'main' into tst/test_t_sne-32bit

2577885

Merge branch 'main' into tst/test_t_sne-32bit

7d60054

jjerphan added 2 commits December 2, 2022 16:07

TST Remove test_init_not_available

b9ec8eb

The parameter validation now tests this.

TST Adapt atol for test_sparse_precomputed_distance on float64

0f6f2b8

jjerphan closed this Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675

TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675

jjerphan commented Mar 3, 2022 •

edited

Loading

ogrisel left a comment

ogrisel Mar 4, 2022

ogrisel Mar 4, 2022

jjerphan Mar 17, 2022

glemaitre Jun 9, 2022

jjerphan Jun 10, 2022

ogrisel Mar 4, 2022

jjerphan Mar 17, 2022 •

edited

Loading

glemaitre Jun 9, 2022

ogrisel commented Mar 4, 2022 •

edited

Loading

glemaitre Jun 9, 2022

jjerphan Oct 24, 2022 •

edited

Loading

glemaitre Jun 9, 2022

glemaitre Jun 9, 2022

glemaitre Jun 9, 2022

jjerphan Oct 24, 2022

glemaitre left a comment

jjerphan commented Dec 15, 2022

jjerphan commented Jan 10, 2023

		@@ -296,10 +305,12 @@ def test_preserve_trustworthiness_approximately(method, init):
		assert t > 0.85

	assert t > 0.85
	assert t > 0.85
	assert X_embedded.dtype == dtype

	if isinstance(self._init, np.ndarray):
	X_embedded = self._init
	elif self._init == "pca":
	pca = PCA(
	n_components=self.n_components,
	svd_solver="randomized",
	random_state=random_state,
	)
	X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)
	# TODO: Update in 1.2
	# PCA is rescaled so that PC1 has standard deviation 1e-4 which is
	# the default value for random initialization. See issue #18018.
	warnings.warn(
	"The PCA initialization in TSNE will change to "
	"have the standard deviation of PC1 equal to 1e-4 "
	"in 1.2. This will ensure better convergence.",
	FutureWarning,
	)
	# X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4
	elif self._init == "random":
	# The embedding is initialized with iid samples from Gaussians with
	# standard deviation 1e-4.
	X_embedded = 1e-4 * random_state.standard_normal(
	size=(n_samples, self.n_components)
	).astype(np.float32)
	else:
	raise ValueError("'init' must be 'pca', 'random', or a numpy array")

		@@ -480,11 +496,12 @@ def test_init_not_available():
		tsne.fit_transform(np.array([[0.0], [1.0]]))

TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675

TST use global_dtype in sklearn/manifold/tests/test_t_sne.py #22675

Conversation

jjerphan commented Mar 3, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjerphan Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Mar 4, 2022 • edited Loading

Choose a reason for hiding this comment

jjerphan Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

jjerphan commented Dec 15, 2022

jjerphan commented Jan 10, 2023

jjerphan commented Mar 3, 2022 •

edited

Loading

jjerphan Mar 17, 2022 •

edited

Loading

ogrisel commented Mar 4, 2022 •

edited

Loading

jjerphan Oct 24, 2022 •

edited

Loading