Skip to content

[MRG] Add sparse efficiency warning to randomized_svd (fixes randomized_svd is slow for dok_matrix and lil_matrix #11262) #11264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

scottgigante
Copy link
Contributor

@scottgigante scottgigante commented Jun 14, 2018

Reference Issues/PRs

Fixes #11262

What does this implement/fix? Explain your changes.

Adds a scipy.sparse.SparseEfficiencyWarning when randomized_svd is run on lil_matrix or dok_matrix.

Any other comments?

coo_matrix is about 1.5x slower than csr_matrix. Maybe it is worth also warning about this, but that's a judgement call. Also worth considering whether it is better to simply coerce the matrix to a better data type, rather than printing a warning.

lil_matrix is 4x slower than csr_matrix, dok_matrix is ~50x slower.
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @scottgigante !

Minor comment below, otherwise LGTM.

@@ -322,6 +321,15 @@ def randomized_svd(M, n_components, n_oversamples=10, n_iter='auto',
# this implementation is a bit faster with smaller shape[1]
M = M.T

if isinstance(M, sparse.lil_matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

if isinstance(M, (sparse.lil_matrix, sparse.dok_matrix)):
    warnings.warn(("Calculating SVD of a %s is expensive. "
                   "csr_matrix is more efficient.")
                   % type(M).__name__))

@scottgigante
Copy link
Contributor Author

Thanks for the review @rth ! Unapologetically used the modern format method introduced in Python 2.6, but otherwise that change is in ;)

@scottgigante scottgigante changed the title Add sparse efficiency warning to randomized_svd (fixes randomized_svd is slow for dok_matrix and lil_matrix #11262) [MRG] Add sparse efficiency warning to randomized_svd (fixes randomized_svd is slow for dok_matrix and lil_matrix #11262) Jun 14, 2018
if isinstance(M, (sparse.lil_matrix, sparse.dok_matrix)):
warnings.warn("Calculating SVD of a {} is expensive. "
"csr_matrix is more efficient.".format(
type(M).__name__))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually adding a unit test under sklearn/utils/tests/test_extmath.py to ensure this warning is raised would be useful. Also in earlier version this raised SparseEfficiencyWarning I didn't mean to suggest to remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You caught me right in the middle of writing the test! Also my apologies for dropping off the warning type, that's my poor attention span.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -365,6 +365,21 @@ def test_randomized_svd_power_iteration_normalizer():
assert_greater(15, np.abs(error_2 - error))


def test_randomized_svd_sparse_warnings():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you figure out a way to make the test less time-consuming (e.g., smaller matrix?). Currently, it's among the most time-consuming tests (See Travis log).

@qinhanmin2014
Copy link
Member

Another minor thing (you might ignore and keep current version), will it be better to raise the warning at the beginning of the function? This might makes the code more friendly.

@scottgigante
Copy link
Contributor Author

Thanks for the review, @qinhanmin2014 . I've moved the warning to the start of the function and made the test matrix much smaller.

@rth rth merged commit 56dc374 into scikit-learn:master Jun 17, 2018
@rth
Copy link
Member

rth commented Jun 17, 2018

Thanks @scottgigante !

georgipeev pushed a commit to georgipeev/scikit-learn that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

randomized_svd is slow for dok_matrix and lil_matrix
3 participants