-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+2] Incremental PCA #3285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Incremental PCA #3285
Conversation
>>> ipca = IncrementalPCA(n_components=2) | ||
>>> ipca.fit(X) | ||
IncrementalPCA(batch_size=10, copy=True, n_components=2) | ||
>>> print(np.linalg.norm(ipca.components_, axis=0)) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older versions of numpy don't understand the axis
argument in np.linalg.norm
. One of the travis builds is failing due to this.
This looks really nice! |
These plots show IncrementalPCA in comparison to PCA and RandomizedPCA. Note that for the error vs. batch_size tests, RandomizedPCA was not shown because it is significantly worse in all tested cases. This makes sense - random projections should be worse than projections estimated from the data (in general). The PCA lines in error and time vs. batch_size are constant because PCA has no batch_size, but it makes a good reference to compare against. Note that the "jagged plot" has a y scale of 1E-8... pretty sure this is just small numerical differences - the average is consistent with PCA across batch size (and the change is +- 5e-8, which seems OK to me). |
Additionally, it would be very nice to add whitening and explained variance to this estimator. I have a few leads on how to do this, but I cannot find any academic papers about incremental estimation of these parameters. If I can get explained variance, then whitening follows naturally. If anyone has any ideas I am all ears. An added advantage is that if these parameters can be estimated, then PCA and IncrementalPCA could be merged since batch_size=n_samples processing is exactly the same as a "regular" PCA. This may or may not be a good option, but it is something to think about. |
explained variance is nothing other than rescaled squared singular values. On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:
|
I have access to the first ones separately, then everything after that is kind of "joint" in that you only get You could do 2 SVDs for every pass through partial_fit but it seems pretty wasteful just for explained variance - and I feel there must be a way to compute it directly (or at least reverse it out somehow, since we know the first values and a combined version of all the ones after)! Maybe using the old S (S for the next to last batch) in conjunction with the final S somehow? Another approach would be to take the ideas from this StackExchange link and try to use them somehow... |
OK, so the S you have access to do not correspond to the global S of the I do not know of a formula using old_S and S, but will keep the problem in On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:
|
total variance is np.sum(X * X) (np.sum(eigvals) - np.sum(X * X)) / np.sum(X * X) or something like this clear? |
The total variance calculation is clear, but the eigvals part is not - the eigvals that fall out of the current calculation come from |
Initial investigations indicate that S ** 2 is still good for calculating explained_variance_ and explained_variance_ratio_ . Working on a commit now with calculations and unit tests, and will post some plots when that is done. This should mean that whitening can be added as well. I guess now is a good time to start the discussion about inverse_transform whitening - I think it is preferable to invert the whitening as well, even though PCA doesn't currently do this. Because there is no prior usage of IncrementalPCA module, we would be starting off "correctly" IMO without the baggage/possible deprecation that may have to happen in PCA. See the discussion at #3107 for more details. |
Nice! For incremental SVD I |
Comparison of IncrementalPCA vs PCA explained_variance_ and explained_variance_ratio. I am still skeptical of the explained_variance_ratio_ - not sure if I should track the overall sum as is currently happening, or only track the sum through [:n_components] ... You lose information due to cutting components at each partial_fit, but just like the other plots, the closer your batch_size gets to n_samples and the more components are kept, the closer the estimate gets to the PCA result. If this seems reasonable, I will work on adding whitening (as well as inverse transform unwhitening) and some tests for these extras (probably just comparing to PCA result to make sure they are close) Generating script:
|
After weighting with the ratio of the size of the batch vs. the total number of samples seen so far, this estimate looks closer to what I would expect. However, after talking with @ogrisel I am going to try using an online variance estimate (such as discussed here) and see if that gets a more accurate estimate of the ratio - .05 out of a total of 1.0 is still 5% error... |
After tracking the explained_variance_ sum incrementally (using a batch Youngs and Cramer method, found in this paper), things look much better. Now all that is left is to add whitening and noise variance estimates as well as tests. Special thanks to @eickenberg and @ogrisel for discussions about this. |
I think this is ready for another review/look by other people |
@ogrisel and I talked some about factoring some common methods between PCA, RandomizedPCA, KernelPCA, and others (get_precision, get_covariance, transform, inverse_transform) into a BasePCA class, but for now I just re-used the calculations from PCA. In the long term it might be something to think about and discuss. |
@arjoly, @GaelVaroquaux any final review? Maybe @larsmans would be interested in having a look at this one as well. |
d981b0e
to
5b5dd79
Compare
I updated the what's new - your words summed it up pretty well :) |
replacement for principal component analysis (PCA) when the dataset to be | ||
decomposed is too large to fit in memory. IPCA builds a low-rank approximation | ||
for the input data using an amount of memory which is independent of the | ||
input data size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of the number of samples, I suppose, not the number of features? (In document processing, the number of features tends to grow as O(√_n_) in the number of samples, with a big constant.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The number of features has to stay constant over all batches and the featurewise cost is pretty much SVD dependent... I am looking at a future PR to add the ability to use randomized SVD in the case where desired n_components <<< n_features, which is pretty much exactly the case for documents IIRC! Basically a solver = 'random' parameter that changes the central SVD. @ogrisel wanted this and it is a really good idea :)
I'm late to the party and I don't know the algorithm, but this seems like a useful addition and (apart from my comments) a well-written, readable estimator class. It looks like one could even do a sparse-matrix PCA with this: densify in batches, then feed those to the estimator (don't worry Gaël, I'm not suggesting to add that to the class :). |
I never really thought about densify-ing sparse data, but I like the idea... will have to play with that. Maybe adding the SVD solver to the parameters will open up some space to play with this? Assuming randomized SVD would be way better for sparse matrices. |
5b5dd79
to
fd0c768
Compare
Randomized SVD is great for sparse data, but after mean-centering the sparsity is destroyed. A different solver could do the mean-centering on the fly instead of as a preprocessing step to avoid densifying. That could even help in the dense case, as there's no longer a need to first compute an |
Indeed @larsmans we could do real PCA with sparse data by centering only small mini-batches on the fly. I would vote to implement that in a separate PR though. |
Oh, I'm just mentioning it, not requesting that it should be written now. |
fd0c768
to
5f8271f
Compare
I agree on the new PR side, but I definitely think it is a good idea! Just updated the double backticks. |
Well, in a later PR, I would see value in that. In this PR, I am afraid |
Sure, sure. +1 for merge. |
Thanks @kastnerkyle! |
Thanks @kastnerkyle !!! Congratulation for your high quality code. |
Thanks everyone for all the review and improvements :) glad to see this merged |
🍻 |
cool! thanks for the great work! On Tuesday, September 23, 2014, Alexandre Gramfort notifications@github.com
|
🍻 -------- Original message -------- From: Olivier Grisel notifications@github.com Date:23/09/2014 18:00 (GMT+01:00) To: scikit-learn/scikit-learn scikit-learn@noreply.github.com Cc: Gael Varoquaux gael.varoquaux@normalesup.org Subject: Re: [scikit-learn] [MRG+2] Incremental PCA (#3285) — |
[MRG+2] Incremental PCA
An implementation of Incremental PCA. See Incremental Learning for Visual Tracking, J. Lim, D. Ross, R. Lin, M. Yang, 2004 for more details.
The key idea behind this PR is to have an out-of-core PCA that gives roughly equivalent results while using constant memory (on the order of the
batch_size
rather thann_samples
).