-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH add ARPACK solver to IncrementalPCA
to avoid densifying sparse data
#29512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -86,7 +86,7 @@ def get_precision(self): | |||
exp_var_diff = xp.where( | |||
exp_var > self.noise_variance_, | |||
exp_var_diff, | |||
xp.asarray(0.0, device=device(exp_var)), | |||
xp.asarray(1e-10, device=device(exp_var)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.0
becomes nan
when doing 1.0 / exp_var_diff
later, which cannot be taken linalg.inv
of. I wonder if giving it a small value instead is reasonable; otherwise perhaps exp_var
should theoretically be always greater than self.noise_variance_
which in turn means that my implementation is incorrect somewhere?
Note: test_incremental_pca_sparse
when n_components = X.shape[1] - 1
triggers the issue.
IncrementalPCA
to avoid densifying sparse data
Fixes #28386, motivated by #18689.
_implicit_column_offset
operator as implemented in #18689.svd_solver
parameter supporting"full"
(default, original behavior) and"arpack"
(truncated SVD)_implicit_vstack
operator to avoid densifying data in intermediate steps._implicit_vstack
.IncrementalPCA
withsvd_solver="arpack"
.fetch_20newsgroups_vectorized
dataset and update changelog.Enhancement Overview
The following code uses the first 500 entries from the 20 newsgroups training set, of shape
(500, 130107)
. When both using truncated SVD via ARPACK, the sparse routine is ~3x faster and saves >30x memory than the dense routine. Compare with dense routine with full SVD (which is the original setup), it is ~10x faster.Example code
Additional Comments & Questions
About the new
svd_solver
parameter: This is added because I found no other way to support sparse input without densifying and I think its reasonable to add."full"
(default) is the original behavior, where sparse data will be densified in batches."arpack"
is the truncated SVD version that will not densify sparse data. I did not add an"auto"
parameter because I think ideally it should select"arpack"
for sparse data which is not the default behavior. Perhaps we can still have an"auto"
option but not as the default and make it default some day?About sparse support: Previously the
fit
method accepts CSR, CSC, and LIL formats. This PR no longer supports LIL format as the sparse version of_incremental_mean_and_var
only supports CSR and CSC formats. We can indeed convert LIL so CSR/CSC to keep supporting that format, but is this necessary? Maybe we can just add a note somewhere in the changelog because it is very easy for users to do the conversion themselves.About testing: I currently simply extended most tests to both
svd_solver
s on dense data; do I need to extend them on dense and sparse containers as well? Currently the only test that uses sparse data plus ARPACK solver istest_incremental_pca_sparse
which performs some basic validation as before. Is this enough?