Skip to content

TST refactor test_truncated_svd #14140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 24, 2019

Conversation

rth
Copy link
Member

@rth rth commented Jun 21, 2019

Similarly to #14138 refactor test_truncated_svd to make them more parametrized and avoid assert_array_almost_equal.

This would help for adding a new solver in #12319

Should be fairly easy to review.

@rth rth requested a review from glemaitre June 21, 2019 16:49
Copy link
Member Author

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few explanations



def test_algorithms():
@pytest.mark.parametrize("algorithm", ['randomized'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would help once we add the lobpcg solver

assert_array_almost_equal(apca.singular_values_, rpca.singular_values_, 12)
pca = TruncatedSVD(n_components=2, algorithm='randomized',
random_state=rng).fit(X)
assert_allclose(apca.singular_values_, pca.singular_values_, rtol=1e-2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with a higher relative tolerance which I assume means that singular_values are very tiny..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they are not tiny,

>>> apca.singular_values_
[17.57446934 17.47000919]
>>> pca.singular_values_
[17.52238412 17.3596697 ]

but also certainly not equal up to 1e-12 retlative tolerance.

This was previously passing because we were comparing arpack solver against itself by mistake a few lines above.

assert_allclose(np.sum(apca.singular_values_**2.0),
np.linalg.norm(X_apca, "fro")**2.0, rtol=1e-2)
assert_allclose(np.sum(pca.singular_values_**2.0),
np.linalg.norm(X_pca, "fro")**2.0, rtol=1e-2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of changes

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Compare sparse vs. dense
for svd_sparse, svd_dense in svds_sparse_v_dense:
assert_array_almost_equal(svd_sparse.explained_variance_ratio_,
svd_dense.explained_variance_ratio_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed as it is indirectly tested with checking that explained_variance_ratio_ for both dense and sparse is equal to the expected value below.

@pytest.fixture(scope='module')
def X_sparse():
# Make an X that looks somewhat like a small tf-idf matrix.
# XXX newer versions of SciPy >0.16 have scipy.sparse.rand for this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now our reason for not using scipy.sparse.rand must be different: we require scipy > 0.16

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I switched to scipy.sparse.rand here.



@pytest.mark.parametrize('fmt', ("array", "csr", "csc", "coo", "lil"))
def test_sparse_formats(fmt):
Xfmt = Xdense if fmt == "dense" else getattr(X, "to" + fmt)()
def test_sparse_formats(fmt, X_sparse):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that this is not quite covered by common tests. Apparently we check predict and predict_proba, but not transform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #14176 to address this.


@pytest.fixture(scope='module')
def X_sparse():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always wanted to do this more for our tests. (Make data into module level fixtures) Are we okay with this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not.

The only risk is in place modification of the data. Common tests that check read-only arrays as input should prevent it somewhat. Still one way around it could be a module or session scoped fixture that load and prepares the data, and a function scoped fixture that makes a copy of it. Another alternative could be to return data array with a read-only=True flag.

Still that's something we could look into separately. Here that should be fairly equivalent to using a global variable with data.

assert_array_almost_equal(comp_a[:9], comp_r[:9])
assert_array_almost_equal(comp_a[9:], comp_r[9:], decimal=2)
assert_allclose(comp_a[:9], comp[:9], rtol=1e-3)
assert_allclose(comp_a[9:], comp[9:], rtol=2e-1, atol=1e-2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is rtol needed here?

Copy link
Member Author

@rth rth Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, later coefficients need higher tolerance, also decimal in assert_array_almost_equal account for both rtol and atol, e.g.,

assert_array_almost_equal(0.01, 0, decimal=1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the check for later coefficients:

assert_allclose(comp_a[9:], comp[9:], rtol=2e-1, atol=1e-2)

with atol=1e-2 it feels like rtol can even be set to zero.

BTW I am fine either way, this is kind of nitpicky. 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second look, you are right -- removed it.

@thomasjpfan thomasjpfan merged commit eade48e into scikit-learn:master Jun 24, 2019
@thomasjpfan
Copy link
Member

Thank you @rth!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants