Skip to content

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 20, 2022

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented May 19, 2022

The main change is to revert the change to sklearn.util.extmath from #23373.

Close #23418. cc @glemaitre.

Other changes:

  • only run up to n_iter=5, n_iter=6 was creating infinite values with power_iteration_normalizer=None
  • tweak criterion when to compute Frobenius norm by batch. Previously it would try to create a dense matrix of ~9GB (20newsgroups datasets is 11314 x 100000 with dtype=float64) and python would be killed by the OOM killer on my machine with 16GB RAM. Edit: I think there is a copy somewhere so you would need 18GB RAM at least prior to my change to run the benchmark.

With this I can run the benchmarks on my machine in ~15 minutes.

@@ -107,7 +107,7 @@

# Determine when to switch to batch computation for matrix norms,
# in case the reconstructed (dense) matrix is too large
MAX_MEMORY = int(2e9)
MAX_MEMORY = int(4e9)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: despite the name, it was actually a MAX_ELEMENTS before, I now multiply by X.dtype.itemsize to be in terms of memory.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the follow up PR.

I tested the benchmark locally and it fixes the issue. I left a small comment, otherwise LGTM.

Comment on lines +327 to +329
if not sp.sparse.issparse(X) or (
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit:

Suggested change
if not sp.sparse.issparse(X) or (
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY
):
if not sp.sparse.issparse(X) or X.nbytes < MAX_MEMORY:

Copy link
Member Author

@lesteve lesteve May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same at one point, but X.nbytes does not work when X is a sparse matrix or X is a pandas DataFrame. Both happen in this benchmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, the not issparse(X) changes the logic a little. With this PR:

  1. If X is a dataframe -> Call SciPy
  2. If X is a ndarray -> Call SciPy
  3. If X is sparse and X.shape[0]... < MAX_MEMORY -> Call SciPy

I'm guessing, we want:

  1. If X is a dataframe -> Call SciPy (The batching code does not work on dataframes)
  2. If X is a ndarray and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly
  3. If X is a sparse matrix and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly
  4. Otherwise, batch.

Copy link
Member Author

@lesteve lesteve May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X is a numpy array, then X fits in the RAM so U.dot(np.diag(s).dot(V)) will fit in the RAM as well (there is a factor 2 but with MAX_MEMORY=4e9 this should be OK). You should then be able to compute A = X - U.dot(np.diag(s).dot(V)) and then call scipy.linalg.norm(A).

The comment "Call Scipy" is slightly misleading, I think the potential blocker is creating the dense matrix U.dot(np.diag(s).dot(V)) when matrices are sparse. If that is possible, calling scipy.linalg.norm(A, norm="fro") will not be an issue I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit tweaking the comment and using X.size rather than X.shape[0] * X.shape[1]

@@ -323,8 +323,9 @@ def norm_diff(A, norm=2, msg=True, random_state=None):


def scalable_frobenius_norm_discrepancy(X, U, s, V):
# if the input is not too big, just call scipy
if X.shape[0] * X.shape[1] < MAX_MEMORY:
if not sp.sparse.issparse(X) or (X.size * X.dtype.itemsize < MAX_MEMORY):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your other comment, I think this needs to be X.shape[0] * X.shape[1]. For a sparse matrix X.size is the number of stored values.

(Also with X.size, I run into a seg fault with big sparse matrix 1000000 x 10000)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks, I reverted to X.shape[0] * X.shape[1]

@thomasjpfan thomasjpfan merged commit ac84b2f into scikit-learn:main May 20, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
@lesteve lesteve deleted the fix-randomized-svd-benchmark branch March 31, 2023 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants