Skip to content

Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions benchmarks/bench_plot_randomized_svd.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@

# Determine when to switch to batch computation for matrix norms,
# in case the reconstructed (dense) matrix is too large
MAX_MEMORY = int(2e9)
MAX_MEMORY = int(4e9)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: despite the name, it was actually a MAX_ELEMENTS before, I now multiply by X.dtype.itemsize to be in terms of memory.


# The following datasets can be downloaded manually from:
# CIFAR 10: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Expand Down Expand Up @@ -323,8 +323,11 @@ def norm_diff(A, norm=2, msg=True, random_state=None):


def scalable_frobenius_norm_discrepancy(X, U, s, V):
# if the input is not too big, just call scipy
if X.shape[0] * X.shape[1] < MAX_MEMORY:
if not sp.sparse.issparse(X) or (
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY
):
Comment on lines +326 to +328
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit:

Suggested change
if not sp.sparse.issparse(X) or (
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY
):
if not sp.sparse.issparse(X) or X.nbytes < MAX_MEMORY:

Copy link
Member Author

@lesteve lesteve May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same at one point, but X.nbytes does not work when X is a sparse matrix or X is a pandas DataFrame. Both happen in this benchmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, the not issparse(X) changes the logic a little. With this PR:

  1. If X is a dataframe -> Call SciPy
  2. If X is a ndarray -> Call SciPy
  3. If X is sparse and X.shape[0]... < MAX_MEMORY -> Call SciPy

I'm guessing, we want:

  1. If X is a dataframe -> Call SciPy (The batching code does not work on dataframes)
  2. If X is a ndarray and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly
  3. If X is a sparse matrix and X.size * X.itemsize < MAX_MEMORY -> Call SciPy directly
  4. Otherwise, batch.

Copy link
Member Author

@lesteve lesteve May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X is a numpy array, then X fits in the RAM so U.dot(np.diag(s).dot(V)) will fit in the RAM as well (there is a factor 2 but with MAX_MEMORY=4e9 this should be OK). You should then be able to compute A = X - U.dot(np.diag(s).dot(V)) and then call scipy.linalg.norm(A).

The comment "Call Scipy" is slightly misleading, I think the potential blocker is creating the dense matrix U.dot(np.diag(s).dot(V)) when matrices are sparse. If that is possible, calling scipy.linalg.norm(A, norm="fro") will not be an issue I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit tweaking the comment and using X.size rather than X.shape[0] * X.shape[1]

# if the input is not sparse or sparse but not too big,
# U.dot(np.diag(s).dot(V)) will fit in RAM
A = X - U.dot(np.diag(s).dot(V))
return norm_diff(A, norm="fro")

Expand Down Expand Up @@ -498,7 +501,7 @@ def bench_c(datasets, n_comps):
if __name__ == "__main__":
random_state = check_random_state(1234)

power_iter = np.linspace(0, 6, 7, dtype=int)
power_iter = np.arange(0, 6)
n_comps = 50

for dataset_name in datasets:
Expand Down
7 changes: 3 additions & 4 deletions sklearn/utils/extmath.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,9 @@ def randomized_range_finder(

# Generating normal random vectors with shape: (A.shape[1], size)
Q = random_state.normal(size=(A.shape[1], size))
if hasattr(A, "dtype") and A.dtype.kind == "f":
# Ensure f32 is preserved as f32
Q = Q.astype(A.dtype, copy=False)

# Deal with "auto" mode
if power_iteration_normalizer == "auto":
Expand All @@ -241,10 +244,6 @@ def randomized_range_finder(
# Extract an orthonormal basis
Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode="economic")

if hasattr(A, "dtype") and A.dtype.kind == "f":
# Ensure f32 is preserved as f32
Q = Q.astype(A.dtype, copy=False)

return Q


Expand Down