Skip to content

ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data #29512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented Jul 17, 2024

Fixes #28386, motivated by #18689.

  • Use _implicit_column_offset operator as implemented in #18689.
  • Add svd_solver parameter supporting "full" (default, original behavior) and "arpack" (truncated SVD)
  • Implement _implicit_vstack operator to avoid densifying data in intermediate steps.
  • Add tests for _implicit_vstack.
  • Add tests for the IncrementalPCA with svd_solver="arpack".
  • Test performance improvement on fetch_20newsgroups_vectorized dataset and update changelog.

Enhancement Overview

The following code uses the first 500 entries from the 20 newsgroups training set, of shape (500, 130107). When both using truncated SVD via ARPACK, the sparse routine is ~3x faster and saves >30x memory than the dense routine. Compare with dense routine with full SVD (which is the original setup), it is ~10x faster.

Example code

import time

import numpy as np
from memory_profiler import profile

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import IncrementalPCA


@profile
def sparse_ipca_arpack(X):
    ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    coords = ipca.fit_transform(X)
    return ipca, coords


@profile
def dense_ipca_arpack(X):
    X = X.toarray()
    ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    coords = ipca.fit_transform(X)
    return ipca, coords


@profile
def dense_ipca_full(X):
    X = X.toarray()
    ipca = IncrementalPCA(n_components=20, svd_solver="full")
    coords = ipca.fit_transform(X)
    return ipca, coords


if __name__ == "__main__":
    X, _ = fetch_20newsgroups_vectorized(return_X_y=True)
    X = X[:500]

    start = time.perf_counter()
    saipca, sacoords = sparse_ipca_arpack(X)
    print(f"Sparse IPCA with ARPACK: {time.perf_counter() - start:.2f}s")

    start = time.perf_counter()
    daipca, dacoords = dense_ipca_arpack(X)
    print(f"Dense IPCA with ARPACK: {time.perf_counter() - start:.2f}s")

    start = time.perf_counter()
    dfipca, dfcoords = dense_ipca_full(X)
    print(f"Dense IPCA full (original): {time.perf_counter() - start:.2f}s")

    assert np.allclose(saipca.components_, daipca.components_)
    assert np.allclose(saipca.explained_variance_, daipca.explained_variance_)
    assert np.allclose(saipca.singular_values_, daipca.singular_values_)
    assert np.allclose(saipca.transform(X), daipca.transform(X))
    assert np.allclose(saipca.transform(X.toarray()), daipca.transform(X.toarray()))

    assert np.allclose(saipca.components_, dfipca.components_)
    assert np.allclose(saipca.explained_variance_, dfipca.explained_variance_)
    assert np.allclose(saipca.singular_values_, dfipca.singular_values_)
    assert np.allclose(saipca.transform(X), dfipca.transform(X))
    assert np.allclose(saipca.transform(X.toarray()), dfipca.transform(X.toarray()))

Sparse IPCA with ARPACK: 1.21s

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10    111.9 MiB    111.9 MiB           1   @profile
    11                                         def sparse_ipca_arpack(X):
    12    111.9 MiB      0.0 MiB           1       ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    13    139.0 MiB     27.1 MiB           1       coords = ipca.fit_transform(X)
    14    139.0 MiB      0.0 MiB           1       return ipca, coords

Dense IPCA with ARPACK: 4.16s

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    17    139.0 MiB    139.0 MiB           1   @profile
    18                                         def dense_ipca_arpack(X):
    19    307.0 MiB    168.0 MiB           1       X = X.toarray()
    20    307.0 MiB      0.0 MiB           1       ipca = IncrementalPCA(n_components=20, svd_solver="arpack")
    21   1281.0 MiB    974.0 MiB           1       coords = ipca.fit_transform(X)
    22   1281.0 MiB      0.0 MiB           1       return ipca, coords

Dense IPCA full (original): 39.42s

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    25    784.7 MiB    784.7 MiB           1   @profile
    26                                         def dense_ipca_full(X):
    27    952.7 MiB    168.0 MiB           1       X = X.toarray()
    28    952.7 MiB      0.0 MiB           1       ipca = IncrementalPCA(n_components=20, svd_solver="full")
    29   1907.0 MiB    954.3 MiB           1       coords = ipca.fit_transform(X)
    30   1907.0 MiB      0.0 MiB           1       return ipca, coords

Additional Comments & Questions

About the new svd_solver parameter: This is added because I found no other way to support sparse input without densifying and I think its reasonable to add. "full" (default) is the original behavior, where sparse data will be densified in batches. "arpack" is the truncated SVD version that will not densify sparse data. I did not add an "auto" parameter because I think ideally it should select "arpack" for sparse data which is not the default behavior. Perhaps we can still have an "auto" option but not as the default and make it default some day?

About sparse support: Previously the fit method accepts CSR, CSC, and LIL formats. This PR no longer supports LIL format as the sparse version of _incremental_mean_and_var only supports CSR and CSC formats. We can indeed convert LIL so CSR/CSC to keep supporting that format, but is this necessary? Maybe we can just add a note somewhere in the changelog because it is very easy for users to do the conversion themselves.

About testing: I currently simply extended most tests to both svd_solvers on dense data; do I need to extend them on dense and sparse containers as well? Currently the only test that uses sparse data plus ARPACK solver is test_incremental_pca_sparse which performs some basic validation as before. Is this enough?

Copy link

github-actions bot commented Jul 17, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e3f2122. Link to the linter CI: here

@@ -86,7 +86,7 @@ def get_precision(self):
exp_var_diff = xp.where(
exp_var > self.noise_variance_,
exp_var_diff,
xp.asarray(0.0, device=device(exp_var)),
xp.asarray(1e-10, device=device(exp_var)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0 becomes nan when doing 1.0 / exp_var_diff later, which cannot be taken linalg.inv of. I wonder if giving it a small value instead is reasonable; otherwise perhaps exp_var should theoretically be always greater than self.noise_variance_ which in turn means that my implementation is incorrect somewhere?

Note: test_incremental_pca_sparse when n_components = X.shape[1] - 1 triggers the issue.

@Charlie-XIAO Charlie-XIAO changed the title ENH support partial fitting incremental PCA on sparse data ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data Jul 19, 2024
@Charlie-XIAO Charlie-XIAO marked this pull request as ready for review July 19, 2024 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proper sparse support in IncrementalPCA
1 participant