Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

lesteve · 2022-05-19T13:18:23Z

The main change is to revert the change to sklearn.util.extmath from #23373.

Other changes:

only run up to n_iter=5, n_iter=6 was creating infinite values with power_iteration_normalizer=None
tweak criterion when to compute Frobenius norm by batch. Previously it would try to create a dense matrix of ~9GB (20newsgroups datasets is 11314 x 100000 with dtype=float64) and python would be killed by the OOM killer on my machine with 16GB RAM. Edit: I think there is a copy somewhere so you would need 18GB RAM at least prior to my change to run the benchmark.

With this I can run the benchmarks on my machine in ~15 minutes.

lesteve · 2022-05-19T14:48:01Z

benchmarks/bench_plot_randomized_svd.py

@@ -107,7 +107,7 @@

 # Determine when to switch to batch computation for matrix norms,
 # in case the reconstructed (dense) matrix is too large
-MAX_MEMORY = int(2e9)
+MAX_MEMORY = int(4e9)


Note: despite the name, it was actually a MAX_ELEMENTS before, I now multiply by X.dtype.itemsize to be in terms of memory.

glemaitre

LGTM

thomasjpfan

Thank you for the follow up PR.

I tested the benchmark locally and it fixes the issue. I left a small comment, otherwise LGTM.

thomasjpfan · 2022-05-19T16:16:20Z

benchmarks/bench_plot_randomized_svd.py

+    if not sp.sparse.issparse(X) or (
+        X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY
+    ):


Small nit:

Suggested change

if not sp.sparse.issparse(X) or (

X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY

):

if not sp.sparse.issparse(X) or X.nbytes < MAX_MEMORY:

I thought the same at one point, but X.nbytes does not work when X is a sparse matrix or X is a pandas DataFrame. Both happen in this benchmark.

On closer inspection, the not issparse(X) changes the logic a little. With this PR:

If X is a dataframe -> Call SciPy

If X is a ndarray -> Call SciPy

If X is sparse and X.shape[0]... < MAX_MEMORY -> Call SciPy

I'm guessing, we want:

If X is a dataframe -> Call SciPy (The batching code does not work on dataframes)

If X is a ndarray and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly

If X is a sparse matrix and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly

Otherwise, batch.

If X is a numpy array, then X fits in the RAM so U.dot(np.diag(s).dot(V)) will fit in the RAM as well (there is a factor 2 but with MAX_MEMORY=4e9 this should be OK). You should then be able to compute A = X - U.dot(np.diag(s).dot(V)) and then call scipy.linalg.norm(A).

The comment "Call Scipy" is slightly misleading, I think the potential blocker is creating the dense matrix U.dot(np.diag(s).dot(V)) when matrices are sparse. If that is possible, calling scipy.linalg.norm(A, norm="fro") will not be an issue I think.

I pushed a commit tweaking the comment and using X.size rather than X.shape[0] * X.shape[1]

…nto fix-randomized-svd-benchmark

thomasjpfan · 2022-05-20T13:52:42Z

benchmarks/bench_plot_randomized_svd.py

@@ -323,8 +323,9 @@ def norm_diff(A, norm=2, msg=True, random_state=None):


 def scalable_frobenius_norm_discrepancy(X, U, s, V):
-    # if the input is not too big, just call scipy
-    if X.shape[0] * X.shape[1] < MAX_MEMORY:
+    if not sp.sparse.issparse(X) or (X.size * X.dtype.itemsize < MAX_MEMORY):


Given your other comment, I think this needs to be X.shape[0] * X.shape[1]. For a sparse matrix X.size is the number of stored values.

(Also with X.size, I run into a seg fault with big sparse matrix 1000000 x 10000)

OK thanks, I reverted to X.shape[0] * X.shape[1]

scikit-learn#23421)

#23421)

Revert change in sklearn.extmath.util and fix randomized_svd benchmark

af36339

github-actions bot added the module:utils label May 19, 2022

Put the code back as it was

b016b73

lesteve commented May 19, 2022

View reviewed changes

lesteve mentioned this pull request May 19, 2022

Revert "FIX Update randomized SVD benchmark" #23418

Closed

glemaitre approved these changes May 19, 2022

View reviewed changes

thomasjpfan approved these changes May 19, 2022

View reviewed changes

lorentzbao mentioned this pull request May 20, 2022

FIX Update randomized SVD benchmark #23373

Merged

lesteve added 2 commits May 20, 2022 08:19

Tweak

cbf4906

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

4641cd1

…nto fix-randomized-svd-benchmark

thomasjpfan reviewed May 20, 2022

View reviewed changes

Fix

3e2aec8

thomasjpfan merged commit ac84b2f into scikit-learn:main May 20, 2022

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

Revert change in sklearn.extmath.util and fix randomized_svd benchmark (

09504d6

scikit-learn#23421)

glemaitre pushed a commit that referenced this pull request Aug 5, 2022

Revert change in sklearn.extmath.util and fix randomized_svd benchmark (

90ccbfb

#23421)

lesteve deleted the fix-randomized-svd-benchmark branch March 31, 2023 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

lesteve commented May 19, 2022 •

edited

Loading

lesteve May 19, 2022

glemaitre left a comment

thomasjpfan left a comment

thomasjpfan May 19, 2022

lesteve May 19, 2022 •

edited

Loading

thomasjpfan May 19, 2022

lesteve May 20, 2022 •

edited

Loading

lesteve May 20, 2022

thomasjpfan May 20, 2022

lesteve May 20, 2022

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

Conversation

lesteve commented May 19, 2022 • edited Loading

lesteve May 19, 2022

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan May 19, 2022

Choose a reason for hiding this comment

lesteve May 19, 2022 • edited Loading

Choose a reason for hiding this comment

thomasjpfan May 19, 2022

Choose a reason for hiding this comment

lesteve May 20, 2022 • edited Loading

Choose a reason for hiding this comment

lesteve May 20, 2022

Choose a reason for hiding this comment

thomasjpfan May 20, 2022

Choose a reason for hiding this comment

lesteve May 20, 2022

Choose a reason for hiding this comment

lesteve commented May 19, 2022 •

edited

Loading

lesteve May 19, 2022 •

edited

Loading

lesteve May 20, 2022 •

edited

Loading