-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert change in sklearn.extmath.util and fix randomized_svd benchmark #23421
Conversation
@@ -107,7 +107,7 @@ | |||
|
|||
# Determine when to switch to batch computation for matrix norms, | |||
# in case the reconstructed (dense) matrix is too large | |||
MAX_MEMORY = int(2e9) | |||
MAX_MEMORY = int(4e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: despite the name, it was actually a MAX_ELEMENTS before, I now multiply by X.dtype.itemsize to be in terms of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the follow up PR.
I tested the benchmark locally and it fixes the issue. I left a small comment, otherwise LGTM.
if not sp.sparse.issparse(X) or ( | ||
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit:
if not sp.sparse.issparse(X) or ( | |
X.shape[0] * X.shape[1] * X.dtype.itemsize < MAX_MEMORY | |
): | |
if not sp.sparse.issparse(X) or X.nbytes < MAX_MEMORY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same at one point, but X.nbytes
does not work when X
is a sparse matrix or X
is a pandas DataFrame. Both happen in this benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer inspection, the not issparse(X)
changes the logic a little. With this PR:
- If
X
is a dataframe -> Call SciPy - If
X
is a ndarray -> Call SciPy - If
X
is sparse andX.shape[0]... < MAX_MEMORY
-> Call SciPy
I'm guessing, we want:
- If
X
is a dataframe -> Call SciPy (The batching code does not work on dataframes) - If
X
is a ndarray andX.size * X.itemsize < MAX_MEMORY
-> Call SciPy directly - If
X
is a sparse matrix andX.size * X.itemsize < MAX_MEMORY
-> Call SciPy directly - Otherwise, batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If X
is a numpy array, then X
fits in the RAM so U.dot(np.diag(s).dot(V))
will fit in the RAM as well (there is a factor 2 but with MAX_MEMORY=4e9
this should be OK). You should then be able to compute A = X - U.dot(np.diag(s).dot(V))
and then call scipy.linalg.norm(A)
.
The comment "Call Scipy" is slightly misleading, I think the potential blocker is creating the dense matrix U.dot(np.diag(s).dot(V))
when matrices are sparse. If that is possible, calling scipy.linalg.norm(A, norm="fro")
will not be an issue I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit tweaking the comment and using X.size
rather than X.shape[0] * X.shape[1]
…nto fix-randomized-svd-benchmark
@@ -323,8 +323,9 @@ def norm_diff(A, norm=2, msg=True, random_state=None): | |||
|
|||
|
|||
def scalable_frobenius_norm_discrepancy(X, U, s, V): | |||
# if the input is not too big, just call scipy | |||
if X.shape[0] * X.shape[1] < MAX_MEMORY: | |||
if not sp.sparse.issparse(X) or (X.size * X.dtype.itemsize < MAX_MEMORY): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given your other comment, I think this needs to be X.shape[0] * X.shape[1]
. For a sparse matrix X.size
is the number of stored values.
(Also with X.size
, I run into a seg fault with big sparse matrix 1000000 x 10000
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, I reverted to X.shape[0] * X.shape[1]
The main change is to revert the change to
sklearn.util.extmath
from #23373.Close #23418. cc @glemaitre.
Other changes:
n_iter=5
,n_iter=6
was creating infinite values withpower_iteration_normalizer=None
With this I can run the benchmarks on my machine in ~15 minutes.