FIX Reduces memory usage of `PCA.transform` #22553

Micky774 · 2022-02-20T01:07:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR implements an iterative algorithm for calculating variance for sufficiently large data (X.shape[0] > 1000), thereby reducing the peak memory footprint, regardless of choice of copy argument.

Any other comments?

Currently failing two tests due to relative error of approximately 1.4e-5
There is a negligible (0s-4s) slowdown using the iterative method on small/medium datasets (num_features < 6400) due to the induced overhead, however it is incredibly close in speed to the normal method on large datasets.

jeremiedbb · 2022-02-20T14:36:48Z

I'm not convinced about that kind of optimisation. It adds a lot of complexity in terms of maintenance and readability to get around a flow in a single numpy function. I think it would be much better that the memory issue is fixed directly in numpy. There are discussions and even a PR about that numpy/numpy#13199, numpy/numpy#13263. The PR seems a bit stalled but maybe we can see if they are still willing to finish it.

- Created a new method in `extmath.py` for calculating feature-wise in-place variance - PCA now uses in-place feature variance calculation to avoid mem spike - Added new test to check for correctness of internal `total_var` calculation - Fixed existing whitening test by changing `_X`->`_X.copy` to account for the `copy=False` semantics of accepting in-place changes in `X`

Micky774 · 2022-02-25T02:15:03Z

@jeremiedbb Looks like it's not happening, at least on the numpy side of things :(

I agree regarding the induced complexity of the iterative calculation -- actually it looks like it was also very inaccurate, so I'm no longer considering that option.

Instead, I have added my in_place_feature_var method to extmath.py and included a new test for correctness. Currently when the user flags copy=False they agree to allow the input X to be mutated, therefore this in-place calculation is acceptable in both cases. I agree it's less than ideal, but I think it's the best option we have right now. It currently passes the existing tests* and matches the np.var(X, ddof=1, axis=0) function.

The in-place method results in essentially no new allocated memory, and runs faster than the old method.

* I did have to modify the test_whitening in test_pca.py but I believe it is a proper correction. Essentially, they omitted a copy call to one of the inputs to fit.

cc: @thomasjpfan

jeremiedbb

@Micky774 I find this solution much cleaner !

I still think it's the role of numpy to implement a low memory var, so maybe add a comment to just say that it's a workaround because numpy doesn't have that yet.

sklearn/decomposition/_pca.py

Micky774 · 2022-02-27T02:17:59Z

@jeremiedbb I've moved the remainder of the in-place variance calculation into _pca.py and removed the now unnecessary function from extmath.py. I also included the workaround comment in the code so hopefully there'll be a more appropriate replacement later :)

jeremiedbb

LGTM. Thanks @Micky774 !

thomasjpfan

LGTM

Micky774 added 3 commits February 13, 2022 20:43

Initial in-place variance calculation implementation

aeada31

Implemented iterative variance calculation

8a5a8f2

Merge branch 'main' into pca_mem

c38ff68

github-actions bot added the module:decomposition label Feb 20, 2022

Merge branch 'main' into pca_mem

839bf05

Micky774 added 4 commits February 24, 2022 21:21

Added changelog entry

b4ee107

Removed empty line

98daf6b

Added test parameterization for copy

59c7eee

Merge branch 'main' into pca_mem

56d9d73

jeremiedbb reviewed Feb 25, 2022

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

Micky774 added 2 commits February 26, 2022 21:10

Merge branch 'main' into pca_mem

cdd8369

Moved calc out of extmath.py and directly into _pca.py

eb12331

jeremiedbb approved these changes Mar 2, 2022

View reviewed changes

jeremiedbb added Performance Waiting for Reviewer labels Mar 2, 2022

jeremiedbb added this to the 1.1 milestone Mar 2, 2022

thomasjpfan changed the title ~~Reduces memory usage of PCA.transform w/ solver=randomized~~ FIX Reduces memory usage of PCA.transform w/ solver=randomized Mar 2, 2022

thomasjpfan approved these changes Mar 2, 2022

View reviewed changes

thomasjpfan changed the title ~~FIX Reduces memory usage of PCA.transform w/ solver=randomized~~ FIX Reduces memory usage of PCA.transform Mar 2, 2022

thomasjpfan merged commit 2e213c6 into scikit-learn:main Mar 2, 2022

Micky774 deleted the pca_mem branch March 4, 2022 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Reduces memory usage of `PCA.transform` #22553

FIX Reduces memory usage of `PCA.transform` #22553

Micky774 commented Feb 20, 2022 •

edited

Loading

jeremiedbb commented Feb 20, 2022

Micky774 commented Feb 25, 2022

jeremiedbb left a comment

Micky774 commented Feb 27, 2022 •

edited

Loading

jeremiedbb left a comment

thomasjpfan left a comment

FIX Reduces memory usage of PCA.transform #22553

FIX Reduces memory usage of PCA.transform #22553

Conversation

Micky774 commented Feb 20, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jeremiedbb commented Feb 20, 2022

Micky774 commented Feb 25, 2022

jeremiedbb left a comment

Choose a reason for hiding this comment

Micky774 commented Feb 27, 2022 • edited Loading

jeremiedbb left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

FIX Reduces memory usage of `PCA.transform` #22553

FIX Reduces memory usage of `PCA.transform` #22553

Micky774 commented Feb 20, 2022 •

edited

Loading

Micky774 commented Feb 27, 2022 •

edited

Loading