[MRG+2] Incremental PCA #3285

kastnerkyle · 2014-06-17T17:02:26Z

An implementation of Incremental PCA. See Incremental Learning for Visual Tracking, J. Lim, D. Ross, R. Lin, M. Yang, 2004 for more details.

The key idea behind this PR is to have an out-of-core PCA that gives roughly equivalent results while using constant memory (on the order of the batch_size rather than n_samples).

coveralls · 2014-06-20T07:06:39Z

Coverage increased (+0.02%) when pulling 9f3d149 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

eickenberg · 2014-06-20T12:42:05Z

sklearn/decomposition/incremental_pca.py

+    >>> ipca = IncrementalPCA(n_components=2)
+    >>> ipca.fit(X)
+    IncrementalPCA(batch_size=10, copy=True, n_components=2)
+    >>> print(np.linalg.norm(ipca.components_, axis=0)) # doctest: +ELLIPSIS


Older versions of numpy don't understand the axis argument in np.linalg.norm. One of the travis builds is failing due to this.

eickenberg · 2014-06-20T12:45:18Z

This looks really nice!

coveralls · 2014-06-20T14:53:34Z

Coverage increased (+0.02%) when pulling 5c5197f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

coveralls · 2014-06-25T14:00:48Z

Coverage increased (+0.04%) when pulling adc28e7 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

kastnerkyle · 2014-06-25T14:23:03Z

These plots show IncrementalPCA in comparison to PCA and RandomizedPCA. Note that for the error vs. batch_size tests, RandomizedPCA was not shown because it is significantly worse in all tested cases. This makes sense - random projections should be worse than projections estimated from the data (in general). The PCA lines in error and time vs. batch_size are constant because PCA has no batch_size, but it makes a good reference to compare against.

Note that the "jagged plot" has a y scale of 1E-8... pretty sure this is just small numerical differences - the average is consistent with PCA across batch size (and the change is +- 5e-8, which seems OK to me).

kastnerkyle · 2014-06-26T06:47:24Z

Additionally, it would be very nice to add whitening and explained variance to this estimator. I have a few leads on how to do this, but I cannot find any academic papers about incremental estimation of these parameters. If I can get explained variance, then whitening follows naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then PCA and IncrementalPCA could be merged since batch_size=n_samples processing is exactly the same as a "regular" PCA. This may or may not be a good option, but it is something to think about.

eickenberg · 2014-06-26T07:14:10Z

explained variance is nothing other than rescaled squared singular values.
you hav access to those, right?

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

Additionally, it would be very nice to add whitening and explained
variance to this estimator. I have a few leads on how to do this, but I
cannot find any academic papers about incremental estimation of these
parameters. If I can get explained variance, then whitening follows
naturally. If anyone has any ideas I am all ears.

An added advantage is that if these parameters can be estimated, then
PCA and IncrementalPCA could be merged since batch_size=n_samples
processing is exactly the same as a "regular" PCA. This may or may not be a
good option, but it is something to think about.

—
Reply to this email directly or view it on GitHub
#3285 (comment)
.

kastnerkyle · 2014-06-26T07:23:21Z

I have access to the first ones separately, then everything after that is kind of "joint" in that you only get U, S, V = svd((old_components, X, mean_adjust)) instead of just U, S, V = svd(X) like in the first pass through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty wasteful just for explained variance - and I feel there must be a way to compute it directly (or at least reverse it out somehow, since we know the first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with the final S somehow?

Another approach would be to take the ideas from this StackExchange link and try to use them somehow...

eickenberg · 2014-06-26T07:29:49Z

OK, so the S you have access to do not correspond to the global S of the
data seen until then. I would have hoped that by "chance" they do, and that
the differences of the matrix you are working with wrt to the full matrix
are again "by chance" accounted for by a change in U (since V is definitely
the same, right?)

I do not know of a formula using old_S and S, but will keep the problem in
my head :)

On Thursday, June 26, 2014, Kyle Kastner notifications@github.com wrote:

I have access to the first ones separately, then everything after that is
kind of "joint" in that you only get U, S, V = svd((old_components, X,
mean_adjust)) instead of just U, S, V = svd(X) like in the first pass
through.

You could do 2 SVDs for every pass through partial_fit but it seems pretty
wasteful just for explained variance - and I feel there must be a way to
compute it directly (or at least reverse it out somehow, since we know the
first values and a combined version of all the ones after)!

Maybe using the old S (S for the next to last batch) in conjunction with
the final S somehow?

—
Reply to this email directly or view it on GitHub
#3285 (comment)
.

agramfort · 2014-06-26T07:55:50Z

total variance is np.sum(X * X)
which you can calculate by accumulation
explained variance will be

(np.sum(eigvals) - np.sum(X * X)) / np.sum(X * X)

or something like this

clear?

kastnerkyle · 2014-06-26T09:09:42Z

The total variance calculation is clear, but the eigvals part is not - the eigvals that fall out of the current calculation come from svd(vstack((previous_components, X, mean_adjust))), rather than just X. I am trying to run a test now to see if the eigvals at the end of several partial_fit calls is approximately equivalent to eigvals from PCA, but don't know yet. If this is the case, then I think the rule from the PCA block would work directly - calculate (S**2) / n_samples_seen basically, or something similar. If not...

kastnerkyle · 2014-06-27T12:22:16Z

Initial investigations indicate that S ** 2 is still good for calculating explained_variance_ and explained_variance_ratio_ . Working on a commit now with calculations and unit tests, and will post some plots when that is done. This should mean that whitening can be added as well.

I guess now is a good time to start the discussion about inverse_transform whitening - I think it is preferable to invert the whitening as well, even though PCA doesn't currently do this. Because there is no prior usage of IncrementalPCA module, we would be starting off "correctly" IMO without the baggage/possible deprecation that may have to happen in PCA. See the discussion at #3107 for more details.

eickenberg · 2014-06-27T12:30:16Z

Nice! For incremental SVD I ~~am almost sure I have a proof now for S always being the correct S, just like V is always being the correct V (for a full decomposition)~~ may have found a way to show that they are at least close... For incremental PCA, additionally one would need to add the considerations for mean adjustment into this, just like in the cited paper. It could be straightforward, but I haven't looked at it.

kastnerkyle · 2014-06-27T14:10:29Z

Comparison of IncrementalPCA vs PCA explained_variance_ and explained_variance_ratio. I am still skeptical of the explained_variance_ratio_ - not sure if I should track the overall sum as is currently happening, or only track the sum through [:n_components] ...

You lose information due to cutting components at each partial_fit, but just like the other plots, the closer your batch_size gets to n_samples and the more components are kept, the closer the estimate gets to the PCA result.

If this seems reasonable, I will work on adding whitening (as well as inverse transform unwhitening) and some tests for these extras (probably just comparing to PCA result to make sure they are close)

Generating script:

import numpy as np
from sklearn.decomposition import IncrementalPCA, PCA
from sklearn.datasets import make_low_rank_matrix
import matplotlib.pyplot as plt


def run_experiment(X, n_components_tests, batch_size_tests, ratio=False):
    all_err = []
    for n in n_components_tests:
        print("n_components %i" % n)
        pca = PCA(n_components=n)
        pca.fit(X)
        err = []
        for i in batch_size_tests:
            ipca = IncrementalPCA(n_components=n, batch_size=i)
            ipca.fit(X)
            if not ratio:
                print("Explained variance error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ -
                                     pca.explained_variance_))
            else:
                print("Explained variance ratio error, batch_size %i" % i)
                mae = np.mean(np.abs(ipca.explained_variance_ratio_ -
                                     pca.explained_variance_ratio_))
            err.append(mae)
        all_err.append(err)
    return np.array(all_err)

X = make_low_rank_matrix(n_samples=10000, random_state=1999)
n_components_tests = np.array([2, 5, 10, 25, 50, 75, 100])
batch_size_tests = np.array([1000, 2500, 5000, 7500, 10000])
all_var_err = run_experiment(X, n_components_tests, batch_size_tests)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_var_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.figure()
all_ratio_err = run_experiment(X, n_components_tests, batch_size_tests,
                               ratio=True)
for n, i in enumerate(n_components_tests):
    plt.plot(batch_size_tests, all_ratio_err[n, :], label="n_components=%i" % i)
plt.xlabel("Batch size")
plt.ylabel("Explained variance ratio")
plt.title("Mean absolute error between IncrementalPCA and PCA")
plt.legend(loc="upper right")
plt.show()

coveralls · 2014-06-27T14:30:39Z

Coverage increased (+0.07%) when pulling 65cd21f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

kastnerkyle · 2014-06-30T09:10:31Z

Updated the explained variance ratio calculation. It seems closer to correct (errors are close to 0 for n_components=n_features, and otherwise similar to the previous calculation), but the "bump" in the chart leads me to believe that some kind of compensation has to happen when batches are different sizes. around 7500, where the bump is the worst, there are 2 batches - one of 5000, and one of 2500. Ideally, this plot should converge toward 0 as batch size increases.

coveralls · 2014-06-30T09:18:52Z

Coverage increased (+0.07%) when pulling b2af89f on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

kastnerkyle · 2014-06-30T10:57:41Z

After weighting with the ratio of the size of the batch vs. the total number of samples seen so far, this estimate looks closer to what I would expect. However, after talking with @ogrisel I am going to try using an online variance estimate (such as discussed here) and see if that gets a more accurate estimate of the ratio - .05 out of a total of 1.0 is still 5% error...

coveralls · 2014-06-30T11:06:57Z

Coverage increased (+0.07%) when pulling 5d9b4c1 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

kastnerkyle · 2014-07-01T10:51:19Z

After tracking the explained_variance_ sum incrementally (using a batch Youngs and Cramer method, found in this paper), things look much better. Now all that is left is to add whitening and noise variance estimates as well as tests. Special thanks to @eickenberg and @ogrisel for discussions about this.

coveralls · 2014-07-01T11:05:27Z

Coverage increased (+0.02%) when pulling e0cb293 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

coveralls · 2014-07-02T09:57:44Z

Coverage increased (+0.03%) when pulling 500a298 on kastnerkyle:incremental_pca into 9d3b48d on scikit-learn:master.

kastnerkyle · 2014-07-02T12:29:10Z

I think this is ready for another review/look by other people

kastnerkyle · 2014-07-04T09:34:20Z

@ogrisel and I talked some about factoring some common methods between PCA, RandomizedPCA, KernelPCA, and others (get_precision, get_covariance, transform, inverse_transform) into a BasePCA class, but for now I just re-used the calculations from PCA. In the long term it might be something to think about and discuss.

ogrisel · 2014-09-23T11:54:27Z

@arjoly, @GaelVaroquaux any final review? Maybe @larsmans would be interested in having a look at this one as well.

kastnerkyle · 2014-09-23T12:19:45Z

I updated the what's new - your words summed it up pretty well :)

coveralls · 2014-09-23T12:33:12Z

Coverage increased (+0.04%) when pulling 5b5dd79 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

larsmans · 2014-09-23T13:02:35Z

examples/decomposition/plot_incremental_pca.py

+replacement for principal component analysis (PCA) when the dataset to be
+decomposed is too large to fit in memory. IPCA builds a low-rank approximation
+for the input data using an amount of memory which is independent of the
+input data size.


Independent of the number of samples, I suppose, not the number of features? (In document processing, the number of features tends to grow as O(√_n_) in the number of samples, with a big constant.)

Right. The number of features has to stay constant over all batches and the featurewise cost is pretty much SVD dependent... I am looking at a future PR to add the ability to use randomized SVD in the case where desired n_components <<< n_features, which is pretty much exactly the case for documents IIRC! Basically a solver = 'random' parameter that changes the central SVD. @ogrisel wanted this and it is a really good idea :)

larsmans · 2014-09-23T13:12:22Z

I'm late to the party and I don't know the algorithm, but this seems like a useful addition and (apart from my comments) a well-written, readable estimator class.

It looks like one could even do a sparse-matrix PCA with this: densify in batches, then feed those to the estimator (don't worry Gaël, I'm not suggesting to add that to the class :).

kastnerkyle · 2014-09-23T13:28:15Z

I never really thought about densify-ing sparse data, but I like the idea... will have to play with that. Maybe adding the SVD solver to the parameters will open up some space to play with this? Assuming randomized SVD would be way better for sparse matrices.

coveralls · 2014-09-23T13:48:35Z

Coverage increased (+0.04%) when pulling fd0c768 on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

larsmans · 2014-09-23T13:53:23Z

Randomized SVD is great for sparse data, but after mean-centering the sparsity is destroyed. A different solver could do the mean-centering on the fly instead of as a preprocessing step to avoid densifying. That could even help in the dense case, as there's no longer a need to first compute an X - mu matrix.

ogrisel · 2014-09-23T14:11:04Z

Indeed @larsmans we could do real PCA with sparse data by centering only small mini-batches on the fly.

I would vote to implement that in a separate PR though.

larsmans · 2014-09-23T14:13:09Z

Oh, I'm just mentioning it, not requesting that it should be written now.

kastnerkyle · 2014-09-23T14:16:40Z

I agree on the new PR side, but I definitely think it is a good idea! Just updated the double backticks.

coveralls · 2014-09-23T14:32:40Z

Coverage increased (+0.04%) when pulling 5f8271f on kastnerkyle:incremental_pca into fd4ba4d on scikit-learn:master.

GaelVaroquaux · 2014-09-23T15:42:38Z

It looks like one could even do a sparse-matrix PCA with this: densify
in batches, then feed those to the estimator (don't worry Gaël, I'm not
suggesting to add that to the class :).

Well, in a later PR, I would see value in that. In this PR, I am afraid
of delaying the merge of an already very useful feature.

larsmans · 2014-09-23T15:50:15Z

Sure, sure. +1 for merge.

[MRG+2] Incremental PCA

ogrisel · 2014-09-23T16:01:25Z

Thanks @kastnerkyle!

arjoly · 2014-09-23T16:03:46Z

Thanks @kastnerkyle !!! Congratulation for your high quality code.

kastnerkyle · 2014-09-23T16:05:02Z

Thanks everyone for all the review and improvements :) glad to see this merged

agramfort · 2014-09-23T16:06:09Z

🍻

eickenberg · 2014-09-23T16:21:02Z

cool! thanks for the great work!

On Tuesday, September 23, 2014, Alexandre Gramfort notifications@github.com
wrote:

[image: 🍻]

—
Reply to this email directly or view it on GitHub
#3285 (comment)
.

GaelVaroquaux · 2014-09-23T17:28:54Z

🍻

-------- Original message --------

From: Olivier Grisel notifications@github.com

Date:23/09/2014 18:00 (GMT+01:00)

To: scikit-learn/scikit-learn scikit-learn@noreply.github.com

Cc: Gael Varoquaux gael.varoquaux@normalesup.org

Subject: Re: [scikit-learn] [MRG+2] Incremental PCA (#3285)

Merged #3285.

—
Reply to this email directly or view it on GitHub.

[MRG+2] Incremental PCA

eickenberg reviewed Jun 20, 2014
View reviewed changes

kastnerkyle changed the title ~~[WIP] Incremental PCA~~ [MRG] Incremental PCA Jun 26, 2014

kastnerkyle changed the title ~~[MRG] Incremental PCA~~ [WIP] Incremental PCA Jun 27, 2014

kastnerkyle changed the title ~~[WIP] Incremental PCA~~ [MRG] Incremental PCA Jul 2, 2014

kastnerkyle changed the title ~~[MRG] Incremental PCA~~ [WIP] Incremental PCA Jul 4, 2014

kastnerkyle force-pushed the incremental_pca branch from d981b0e to 5b5dd79 Compare September 23, 2014 12:17

larsmans reviewed Sep 23, 2014
View reviewed changes

kastnerkyle force-pushed the incremental_pca branch from 5b5dd79 to fd0c768 Compare September 23, 2014 13:34

Updated what's new to add IncrementalPCA

5f8271f

kastnerkyle force-pushed the incremental_pca branch from fd0c768 to 5f8271f Compare September 23, 2014 14:16

larsmans changed the title ~~[MRG+1] Incremental PCA~~ [MRG+2] Incremental PCA Sep 23, 2014

ogrisel added a commit that referenced this pull request Sep 23, 2014

Merge pull request #3285 from kastnerkyle/incremental_pca

7426e6a

[MRG+2] Incremental PCA

ogrisel merged commit 7426e6a into scikit-learn:master Sep 23, 2014

IssamLaradji pushed a commit to IssamLaradji/scikit-learn that referenced this pull request Oct 13, 2014

Merge pull request scikit-learn#3285 from kastnerkyle/incremental_pca

8190c47

[MRG+2] Incremental PCA

giorgiop mentioned this pull request Sep 23, 2015

[MRG+3] Collapsing PCA and RandomizedPCA #5299

Merged

8 tasks

Uh oh!

[MRG+2] Incremental PCA #3285

[MRG+2] Incremental PCA #3285

Uh oh!

Conversation

kastnerkyle commented Jun 17, 2014

Uh oh!

coveralls commented Jun 20, 2014

Uh oh!

eickenberg Jun 20, 2014

Choose a reason for hiding this comment

Uh oh!

eickenberg commented Jun 20, 2014

Uh oh!

coveralls commented Jun 20, 2014

Uh oh!

coveralls commented Jun 25, 2014

Uh oh!

kastnerkyle commented Jun 25, 2014

Uh oh!

kastnerkyle commented Jun 26, 2014

Uh oh!

eickenberg commented Jun 26, 2014

Uh oh!

kastnerkyle commented Jun 26, 2014

Uh oh!

eickenberg commented Jun 26, 2014

Uh oh!

agramfort commented Jun 26, 2014

Uh oh!

kastnerkyle commented Jun 26, 2014

Uh oh!

kastnerkyle commented Jun 27, 2014

Uh oh!

eickenberg commented Jun 27, 2014

Uh oh!

kastnerkyle commented Jun 27, 2014

Uh oh!

coveralls commented Jun 27, 2014

Uh oh!

kastnerkyle commented Jun 30, 2014

Uh oh!

coveralls commented Jun 30, 2014

Uh oh!

kastnerkyle commented Jun 30, 2014

Uh oh!

coveralls commented Jun 30, 2014

Uh oh!

kastnerkyle commented Jul 1, 2014

Uh oh!

coveralls commented Jul 1, 2014

Uh oh!

coveralls commented Jul 2, 2014

Uh oh!

kastnerkyle commented Jul 2, 2014

Uh oh!

kastnerkyle commented Jul 4, 2014

Uh oh!

ogrisel commented Sep 23, 2014

Uh oh!

kastnerkyle commented Sep 23, 2014

Uh oh!

coveralls commented Sep 23, 2014

Uh oh!

larsmans Sep 23, 2014

Choose a reason for hiding this comment

Uh oh!

kastnerkyle Sep 23, 2014

Choose a reason for hiding this comment

Uh oh!

larsmans commented Sep 23, 2014

Uh oh!

kastnerkyle commented Sep 23, 2014

Uh oh!

coveralls commented Sep 23, 2014

Uh oh!

larsmans commented Sep 23, 2014

Uh oh!

ogrisel commented Sep 23, 2014

Uh oh!