Skip to content

[MRG+2] Incremental PCA #3285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 23, 2014
Merged

Conversation

kastnerkyle
Copy link
Member

An implementation of Incremental PCA. See Incremental Learning for Visual Tracking, J. Lim, D. Ross, R. Lin, M. Yang, 2004 for more details.

The key idea behind this PR is to have an out-of-core PCA that gives roughly equivalent results while using constant memory (on the order of the batch_size rather than n_samples).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 9f3d149 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

>>> ipca = IncrementalPCA(n_components=2)
>>> ipca.fit(X)
IncrementalPCA(batch_size=10, copy=True, n_components=2)
>>> print(np.linalg.norm(ipca.components_, axis=0)) # doctest: +ELLIPSIS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older versions of numpy don't understand the axis argument in np.linalg.norm. One of the travis builds is failing due to this.

@eickenberg
Copy link
Contributor

This looks really nice!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 5c5197f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling adc28e7 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle
Copy link
Member Author

err_vs_bs_ncomp_45
err_vs_bs_ncomp_180
err_vs_bs_ncomp_315
err_vs_bs_ncomp_450
err_vs_ncomp_bs_1000
rt_vs_bs_ncomp_45
rt_vs_bs_ncomp_180
rt_vs_bs_ncomp_315
rt_vs_bs_ncomp_450
rt_vs_ncomp_bs_1000

These plots show IncrementalPCA in comparison to PCA and RandomizedPCA. Note that for the error vs. batch_size tests, RandomizedPCA was not shown because it is significantly worse in all tested cases. This makes sense - random projections should be worse than projections estimated from the data (in general). The PCA lines in error and time vs. batch_size are constant because PCA has no batch_size, but it makes a good reference to compare against.

Note that the "jagged plot" has a y scale of 1E-8... pretty sure this is just small numerical differences - the average is consistent with PCA across batch size (and the change is +- 5e-8, which seems OK to me).

@kastnerkyle kastnerkyle changed the title [WIP] Incremental PCA [MRG] Incremental PCA Jun 26, 2014
@kastnerkyle
Copy link
Member Author

Additionally, it would be very nice to add whitening and explained variance to this estimator. I have a few leads on how to do this, but I cannot find any academic papers about incremental estimation of these parameters. If I can get explained variance, then whitening follows naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then PCA and IncrementalPCA could be merged since batch_size=n_samples processing is exactly the same as a "regular" PCA. This may or may not be a good option, but it is something to think about.

@eickenberg
Copy link
Contributor

explained variance is nothing other than rescaled squared singular values.
you hav access to those, right?

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

Additionally, it would be very nice to add whitening and explained
variance to this estimator. I have a few leads on how to do this, but I
cannot find any academic papers about incremental estimation of these
parameters. If I can get explained variance, then whitening follows
naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then
PCA and IncrementalPCA could be merged since batch_size=n_samples
processing is exactly the same as a "regular" PCA. This may or may not be a
good option, but it is something to think about.


Reply to this email directly or view it on GitHub
#3285 (comment)
.

@kastnerkyle
Copy link
Member Author

I have access to the first ones separately, then everything after that is kind of "joint" in that you only get U, S, V = svd((old_components, X, mean_adjust)) instead of just U, S, V = svd(X) like in the first pass through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty wasteful just for explained variance - and I feel there must be a way to compute it directly (or at least reverse it out somehow, since we know the first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with the final S somehow?

Another approach would be to take the ideas from this StackExchange link and try to use them somehow...

@eickenberg
Copy link
Contributor

OK, so the S you have access to do not correspond to the global S of the
data seen until then. I would have hoped that by "chance" they do, and that
the differences of the matrix you are working with wrt to the full matrix
are again "by chance" accounted for by a change in U (since V is definitely
the same, right?)

I do not know of a formula using old_S and S, but will keep the problem in
my head :)

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

I have access to the first ones separately, then everything after that is
kind of "joint" in that you only get U, S, V = svd((old_components, X,
mean_adjust)) instead of just U, S, V = svd(X) like in the first pass
through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty
wasteful just for explained variance - and I feel there must be a way to
compute it directly (or at least reverse it out somehow, since we know the
first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with
the final S somehow?


Reply to this email directly or view it on GitHub
#3285 (comment)
.

@agramfort
Copy link
Member

total variance is np.sum(X * X)
which you can calculate by accumulation
explained variance will be

(np.sum(eigvals) - np.sum(X * X)) / np.sum(X * X)

or something like this

clear?

@kastnerkyle
Copy link
Member Author

The total variance calculation is clear, but the eigvals part is not - the eigvals that fall out of the current calculation come from svd(vstack((previous_components, X, mean_adjust))), rather than just X. I am trying to run a test now to see if the eigvals at the end of several partial_fit calls is approximately equivalent to eigvals from PCA, but don't know yet. If this is the case, then I think the rule from the PCA block would work directly - calculate (S**2) / n_samples_seen basically, or something similar. If not...

@kastnerkyle
Copy link
Member Author

Initial investigations indicate that S ** 2 is still good for calculating explained_variance_ and explained_variance_ratio_ . Working on a commit now with calculations and unit tests, and will post some plots when that is done. This should mean that whitening can be added as well.

I guess now is a good time to start the discussion about inverse_transform whitening - I think it is preferable to invert the whitening as well, even though PCA doesn't currently do this. Because there is no prior usage of IncrementalPCA module, we would be starting off "correctly" IMO without the baggage/possible deprecation that may have to happen in PCA. See the discussion at #3107 for more details.

@eickenberg
Copy link
Contributor

Nice! For incremental SVD I am almost sure I have a proof now for S always being the correct S, just like V is always being the correct V (for a full decomposition) may have found a way to show that they are at least close... For incremental PCA, additionally one would need to add the considerations for mean adjustment into this, just like in the cited paper. It could be straightforward, but I haven't looked at it.

@kastnerkyle
Copy link
Member Author

exp_var
exp_var_ratio

Comparison of IncrementalPCA vs PCA explained_variance_ and explained_variance_ratio. I am still skeptical of the explained_variance_ratio_ - not sure if I should track the overall sum as is currently happening, or only track the sum through [:n_components] ...

You lose information due to cutting components at each partial_fit, but just like the other plots, the closer your batch_size gets to n_samples and the more components are kept, the closer the estimate gets to the PCA result.

If this seems reasonable, I will work on adding whitening (as well as inverse transform unwhitening) and some tests for these extras (probably just comparing to PCA result to make sure they are close)

Generating script:

import numpy as np
from sklearn.decomposition import IncrementalPCA, PCA
from sklearn.datasets import make_low_rank_matrix
import matplotlib.pyplot as plt


def run_experiment(X, n_components_tests, batch_size_tests, ratio=False):
    all_err = []
    for n in n_components_tests:
        print("n_components %i" % n)
        pca = PCA(n_components=n)
        pca.fit(X)
        err = []
        for i in batch_size_tests:
            ipca = IncrementalPCA(n_components=n, batch_size=i)
            ipca.fit(X)
            if not ratio:
                print("Explained variance error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ -
                                     pca.explained_variance_))
            else:
                print("Explained variance ratio error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ratio_ -
                                     pca.explained_variance_ratio_))
            err.append(mae)
        all_err.append(err)
    return np.array(all_err)

X = make_low_rank_matrix(n_samples=10000, random_state=1999)
n_components_tests = np.array([2, 5, 10, 25, 50, 75, 100])
batch_size_tests = np.array([1000, 2500, 5000, 7500, 10000])
all_var_err = run_experiment(X, n_components_tests, batch_size_tests)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_var_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.figure()
all_ratio_err = run_experiment(X, n_components_tests, batch_size_tests,
                               ratio=True)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_ratio_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance ratio")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.show()

@kastnerkyle kastnerkyle changed the title [MRG] Incremental PCA [WIP] Incremental PCA Jun 27, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) when pulling 65cd21f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle
Copy link
Member Author

exp_var_ratio
Updated the explained variance ratio calculation. It seems closer to correct (errors are close to 0 for n_components=n_features, and otherwise similar to the previous calculation), but the "bump" in the chart leads me to believe that some kind of compensation has to happen when batches are different sizes. around 7500, where the bump is the worst, there are 2 batches - one of 5000, and one of 2500. Ideally, this plot should converge toward 0 as batch size increases.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) when pulling b2af89f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle
Copy link
Member Author

exp_var_ratio

After weighting with the ratio of the size of the batch vs. the total number of samples seen so far, this estimate looks closer to what I would expect. However, after talking with @ogrisel I am going to try using an online variance estimate (such as discussed here) and see if that gets a more accurate estimate of the ratio - .05 out of a total of 1.0 is still 5% error...

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) when pulling 5d9b4c1 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle
Copy link
Member Author

exp_var
exp_var_ratio

After tracking the explained_variance_ sum incrementally (using a batch Youngs and Cramer method, found in this paper), things look much better. Now all that is left is to add whitening and noise variance estimates as well as tests. Special thanks to @eickenberg and @ogrisel for discussions about this.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling e0cb293 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling 500a298 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

@kastnerkyle kastnerkyle changed the title [WIP] Incremental PCA [MRG] Incremental PCA Jul 2, 2014
@kastnerkyle
Copy link
Member Author

I think this is ready for another review/look by other people

@kastnerkyle kastnerkyle changed the title [MRG] Incremental PCA [WIP] Incremental PCA Jul 4, 2014
@kastnerkyle
Copy link
Member Author

@ogrisel and I talked some about factoring some common methods between PCA, RandomizedPCA, KernelPCA, and others (get_precision, get_covariance, transform, inverse_transform) into a BasePCA class, but for now I just re-used the calculations from PCA. In the long term it might be something to think about and discuss.

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2014

@arjoly, @GaelVaroquaux any final review? Maybe @larsmans would be interested in having a look at this one as well.

@kastnerkyle
Copy link
Member Author

I updated the what's new - your words summed it up pretty well :)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling 5b5dd79 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

replacement for principal component analysis (PCA) when the dataset to be
decomposed is too large to fit in memory. IPCA builds a low-rank approximation
for the input data using an amount of memory which is independent of the
input data size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of the number of samples, I suppose, not the number of features? (In document processing, the number of features tends to grow as O(√_n_) in the number of samples, with a big constant.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The number of features has to stay constant over all batches and the featurewise cost is pretty much SVD dependent... I am looking at a future PR to add the ability to use randomized SVD in the case where desired n_components <<< n_features, which is pretty much exactly the case for documents IIRC! Basically a solver = 'random' parameter that changes the central SVD. @ogrisel wanted this and it is a really good idea :)

@larsmans
Copy link
Member

I'm late to the party and I don't know the algorithm, but this seems like a useful addition and (apart from my comments) a well-written, readable estimator class.

It looks like one could even do a sparse-matrix PCA with this: densify in batches, then feed those to the estimator (don't worry Gaël, I'm not suggesting to add that to the class :).

@kastnerkyle
Copy link
Member Author

I never really thought about densify-ing sparse data, but I like the idea... will have to play with that. Maybe adding the SVD solver to the parameters will open up some space to play with this? Assuming randomized SVD would be way better for sparse matrices.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling fd0c768 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

@larsmans
Copy link
Member

Randomized SVD is great for sparse data, but after mean-centering the sparsity is destroyed. A different solver could do the mean-centering on the fly instead of as a preprocessing step to avoid densifying. That could even help in the dense case, as there's no longer a need to first compute an X - mu matrix.

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2014

Indeed @larsmans we could do real PCA with sparse data by centering only small mini-batches on the fly.

I would vote to implement that in a separate PR though.

@larsmans
Copy link
Member

Oh, I'm just mentioning it, not requesting that it should be written now.

@kastnerkyle
Copy link
Member Author

I agree on the new PR side, but I definitely think it is a good idea! Just updated the double backticks.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) when pulling 5f8271f on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

@GaelVaroquaux
Copy link
Member

It looks like one could even do a sparse-matrix PCA with this: densify
in batches, then feed those to the estimator (don't worry Gaël, I'm not
suggesting to add that to the class :).

Well, in a later PR, I would see value in that. In this PR, I am afraid
of delaying the merge of an already very useful feature.

@larsmans
Copy link
Member

Sure, sure. +1 for merge.

@larsmans larsmans changed the title [MRG+1] Incremental PCA [MRG+2] Incremental PCA Sep 23, 2014
ogrisel added a commit that referenced this pull request Sep 23, 2014
@ogrisel ogrisel merged commit 7426e6a into scikit-learn:master Sep 23, 2014
@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2014

Thanks @kastnerkyle!

@arjoly
Copy link
Member

arjoly commented Sep 23, 2014

Thanks @kastnerkyle !!! Congratulation for your high quality code.

@kastnerkyle
Copy link
Member Author

Thanks everyone for all the review and improvements :) glad to see this merged

@agramfort
Copy link
Member

🍻

@eickenberg
Copy link
Contributor

cool! thanks for the great work!

On Tuesday, September 23, 2014, Alexandre Gramfort notifications@github.com
wrote:

[image: 🍻]


Reply to this email directly or view it on GitHub
#3285 (comment)
.

@GaelVaroquaux
Copy link
Member

🍻

-------- Original message --------
From: Olivier Grisel notifications@github.com
Date:23/09/2014 18:00 (GMT+01:00)
To: scikit-learn/scikit-learn scikit-learn@noreply.github.com
Cc: Gael Varoquaux gael.varoquaux@normalesup.org
Subject: Re: [scikit-learn] [MRG+2] Incremental PCA (#3285)
Merged #3285.


Reply to this email directly or view it on GitHub.

IssamLaradji pushed a commit to IssamLaradji/scikit-learn that referenced this pull request Oct 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants