-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Reduces memory usage of PCA.transform
#22553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not convinced about that kind of optimisation. It adds a lot of complexity in terms of maintenance and readability to get around a flow in a single numpy function. I think it would be much better that the memory issue is fixed directly in numpy. There are discussions and even a PR about that numpy/numpy#13199, numpy/numpy#13263. The PR seems a bit stalled but maybe we can see if they are still willing to finish it. |
- Created a new method in `extmath.py` for calculating feature-wise in-place variance - PCA now uses in-place feature variance calculation to avoid mem spike - Added new test to check for correctness of internal `total_var` calculation - Fixed existing whitening test by changing `_X`->`_X.copy` to account for the `copy=False` semantics of accepting in-place changes in `X`
@jeremiedbb Looks like it's not happening, at least on the numpy side of things :( I agree regarding the induced complexity of the iterative calculation -- actually it looks like it was also very inaccurate, so I'm no longer considering that option. Instead, I have added my The in-place method results in essentially no new allocated memory, and runs faster than the old method. * I did have to modify the cc: @thomasjpfan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Micky774 I find this solution much cleaner !
I still think it's the role of numpy to implement a low memory var, so maybe add a comment to just say that it's a workaround because numpy doesn't have that yet.
@jeremiedbb I've moved the remainder of the in-place variance calculation into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Micky774 !
PCA.transform
w/ solver=randomized
PCA.transform
w/ solver=randomized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PCA.transform
w/ solver=randomized
PCA.transform
Reference Issues/PRs
Fixes #11102
What does this implement/fix? Explain your changes.
This PR implements an iterative algorithm for calculating variance for sufficiently large data (
X.shape[0] > 1000
), thereby reducing the peak memory footprint, regardless of choice ofcopy
argument.Any other comments?
1.4e-5
num_features < 6400
) due to the induced overhead, however it is incredibly close in speed to the normal method on large datasets.