Skip to content

FIX Reduces memory usage of PCA.transform #22553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 2, 2022

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Feb 20, 2022

Reference Issues/PRs

Fixes #11102

What does this implement/fix? Explain your changes.

This PR implements an iterative algorithm for calculating variance for sufficiently large data (X.shape[0] > 1000), thereby reducing the peak memory footprint, regardless of choice of copy argument.

Any other comments?

  • Currently failing two tests due to relative error of approximately 1.4e-5
  • There is a negligible (0s-4s) slowdown using the iterative method on small/medium datasets (num_features < 6400) due to the induced overhead, however it is incredibly close in speed to the normal method on large datasets.

@jeremiedbb
Copy link
Member

I'm not convinced about that kind of optimisation. It adds a lot of complexity in terms of maintenance and readability to get around a flow in a single numpy function. I think it would be much better that the memory issue is fixed directly in numpy. There are discussions and even a PR about that numpy/numpy#13199, numpy/numpy#13263. The PR seems a bit stalled but maybe we can see if they are still willing to finish it.

- Created a new method in `extmath.py` for calculating feature-wise
in-place variance
- PCA now uses in-place feature variance calculation to avoid mem spike
- Added new test to check for correctness of internal `total_var`
calculation
- Fixed existing whitening test by changing `_X`->`_X.copy` to account
for the `copy=False` semantics of accepting in-place changes in `X`
@Micky774
Copy link
Contributor Author

@jeremiedbb Looks like it's not happening, at least on the numpy side of things :(

I agree regarding the induced complexity of the iterative calculation -- actually it looks like it was also very inaccurate, so I'm no longer considering that option.

Instead, I have added my in_place_feature_var method to extmath.py and included a new test for correctness. Currently when the user flags copy=False they agree to allow the input X to be mutated, therefore this in-place calculation is acceptable in both cases. I agree it's less than ideal, but I think it's the best option we have right now. It currently passes the existing tests* and matches the np.var(X, ddof=1, axis=0) function.

The in-place method results in essentially no new allocated memory, and runs faster than the old method.

* I did have to modify the test_whitening in test_pca.py but I believe it is a proper correction. Essentially, they omitted a copy call to one of the inputs to fit.

cc: @thomasjpfan

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Micky774 I find this solution much cleaner !

I still think it's the role of numpy to implement a low memory var, so maybe add a comment to just say that it's a workaround because numpy doesn't have that yet.

@Micky774
Copy link
Contributor Author

Micky774 commented Feb 27, 2022

@jeremiedbb I've moved the remainder of the in-place variance calculation into _pca.py and removed the now unnecessary function from extmath.py. I also included the workaround comment in the code so hopefully there'll be a more appropriate replacement later :)

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Micky774 !

@jeremiedbb jeremiedbb added this to the 1.1 milestone Mar 2, 2022
@thomasjpfan thomasjpfan changed the title Reduces memory usage of PCA.transform w/ solver=randomized FIX Reduces memory usage of PCA.transform w/ solver=randomized Mar 2, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title FIX Reduces memory usage of PCA.transform w/ solver=randomized FIX Reduces memory usage of PCA.transform Mar 2, 2022
@thomasjpfan thomasjpfan merged commit 2e213c6 into scikit-learn:main Mar 2, 2022
@Micky774 Micky774 deleted the pca_mem branch March 4, 2022 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Randomized PCA.transform uses a lot of RAM
3 participants