[MRG + 2] ENH: optimizing power iterations phase for randomized_svd #5141

giorgiop · 2015-08-19T19:34:23Z

I am writing some benchmarks to see if we can set a default number of power iterations for `utils.extmath.randomized_SVD'. The function implements the work of Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions, Halko, et al., 2009 http://arxiv.org/abs/arXiv:0909.4061

See the code for mode details.

Still to do:

review unit tests for randomized_svd
benchmark if we can find an absolute number regardless of input matrix rank , e.g. is 1 always better than 0. In other words, can we find a small integer such that we always improve approximation quality while not making the algorithm significantly slower?
extract at least 5 components in every experiment
extend randomized_svd with normalized power iterations with QR at each step, and another version with LU
~~sample the random matrix from U[-1,1] instead of from a Gaussian~~
docs

And finally

set a default value for n_iter in randomized_svd and a default strategy for normalizing matrices between power iterations
review all use of randomized_svd (in TruncatedSVD and RandomizedPCA)

giorgiop · 2015-08-19T19:58:28Z

singular values distance = squared L2 norm of the difference between singular values from standard SVD and approximated ones from randomized_svd
Frobenius distance = Frobenius norm of the difference between original input matrix and matrix product USV, output of randomized_svd
noise component = parameter in [0,1] input of `make_low_rank_matrix
the size of the dots in the last plots is proportional to the number of power iterations used for that run --growing to the right

So far, plots suggest 2 may be the choice. Larger number may even degrade quality of the approximation. The flat behaviour of the last two plots come from very small datasets.

Ping @ogrisel

ogrisel · 2015-08-20T09:19:51Z

This looks great although the plots on diabetes and abalone are not very interesting as they extract a single component. Maybe you should force to extract at least 5 components to make them more relevant.

lesteve · 2015-08-20T11:58:52Z

benchmarks/bench_plot_randomized_svd.py

+    plt.legend(loc="lower right")
+    plt.axhline(y=0, ls=':', c='black')
+    plt.suptitle("%s: distances from SVD vs. running time\n \
+                  fraction of compontens used in rand_SVD = %i" %


Typo: components.

Also from your plots, it looks like this is the number of components and not the fraction right?

Yes, I always take 5%. That's the absolute number. Thanks.

giorgiop · 2015-08-21T14:17:13Z

scipy.sparse.linalg.svds does decrease a little time consumption, but the singular values and singular vectors returned are sorted differently from randomized_SVD and scipy.linald.svd. I would stick with the original, there is not much gain here anyway.

I have coded a last test. Looking to the 2 new plots, I think we can be confident that 2 is a good default value. Input was (500, 5000). Rank is the rank of the input matrix, n_comps is the parameter of randomized_svd.

giorgiop · 2015-08-21T14:20:04Z

For completeness, here the other plots updated. (There is no change on the small datasets, even computing 5 components.)

ogrisel · 2015-08-24T07:46:50Z

This is great. Thanks for this empirical study. +1 for using 2 power iterations in randomized_svd by default.

ogrisel · 2015-08-24T07:54:51Z

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

on the plus side: this would give a non-negligible speed up with similar accuracy on most datasets,
on the negative side: it can cause a slight change of behavior that users upgrading from a previous version of scikit-learn might not expect.

Any opinion by others, e.g. @larsmans?

amueller · 2015-08-24T21:24:11Z

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

mblondel · 2015-08-25T06:09:52Z

Another thing which would be nice to investigate if you have time: our implementation of range_finder uses Algorithm 4.3 but according to the paper the procedure is numerically unstable. The paper suggests a subspace iteration method in Algorithm 4.4. It would be interesting to compare both algorithms in terms of accuracy and computational time.

giorgiop · 2015-08-31T12:58:54Z

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

It does. And it also uses 2 power iterations by default.

I ran a few tests. Sklearn seems slower. Performance is similar, but sklearn is slightly better here. A difference is that fbpca's performance does not deteriorate when the number of power iterations is "too" large. (I did not looked into the code.)

Code on this gist. Graphs attached.

giorgiop · 2015-08-31T13:14:12Z

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

What people think about this? @ogrisel @larsmans I am happy to have a look into it.

ogrisel · 2015-08-31T13:45:21Z

Sklearn seems slower. Performance is similar, but sklearn is slightly better here.

I don't understand this, could you please rephrase?

ogrisel · 2015-08-31T13:48:00Z

Also could you please annotate the scatter plot with the number of power iterations for each dot?

See for instance: http://stackoverflow.com/questions/5147112/matplotlib-how-to-put-individual-tags-for-a-scatter-plot (no need for arrows)

giorgiop · 2015-08-31T13:49:54Z

Just commenting on the two plots. Sklearn runs slower than fbpca. But performance in term of matrix factorization error is slighly better than fbpca --until sklearn starts to degrade, for lfw_people.

Also could you please annotate the scatter plot with the number of power iterations for each dot?

Sure.

ogrisel · 2015-08-31T14:02:41Z

The difference is how the power iterations are computed, see:

https://github.com/facebook/fbpca/blob/master/fbpca.py#L1562
vs
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L226

A LU factorization lu(Q, permute_l=True) before each power iteration.

It would be worth checking the paper: maybe our code does not do it because of an oversight on our part. It seems to make the power iteration more stable: too many power iterations do not seem to cause a degradation in the quality of the approximation when the LU steps are there.

ogrisel · 2015-08-31T14:04:01Z

Maybe @ajtulloch would be interested in following this discussion :)

giorgiop · 2015-08-31T14:32:38Z

Plots updates

kastnerkyle · 2015-08-31T14:39:48Z

I think fbpca switches "modes" depending on the number of components - IIRC
from looking at the code they switch to doing "full" pca and truncating if
the number of components is >X% of the total. Are we comparing only to the
randomized solver of fbpca or to the external interface (which may switch
modes internally)?

On Mon, Aug 31, 2015 at 10:32 AM, Giorgio Patrini notifications@github.com
wrote:

Plots updates
[image: figure_1]
https://cloud.githubusercontent.com/assets/2871319/9581136/d6fa1710-4ffd-11e5-89da-de7cef990135.png
[image: figure_2]
https://cloud.githubusercontent.com/assets/2871319/9581137/d6fe918c-4ffd-11e5-804e-94f612e59d16.png

—
Reply to this email directly or view it on GitHub
#5141 (comment)
.

amueller · 2015-08-31T18:33:06Z

and should we also switch modes internally?

ogrisel · 2015-09-01T07:42:02Z

I think fbpca switches "modes" depending on the number of components - IIRC from looking at the code they switch to doing "full" pca and truncating if the number of components is >X% of the total. Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

This is only the case when n_components is very close min(n_samples, n_features). We could implement that strategy in sklearn as well (as a safety belt) but in practice I don't think it matters as on large enough data you can only do PCA with n_components << min(n_samples, n_features) (otherwise the SVD is too expensive).

giorgiop · 2015-09-01T10:45:45Z

Looking into the code, the 3 differences I see are:

mode switching when n_comp > 4/5 min(n_samples, n_features).

Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

I am comparing with the external interface pca but I do not think there is any difference, except passing through n_comp > 4/5 min(n_samples, n_features). This was always false in what I ran.

a LU factorization between every multiplication of A and A^T in the power iterations. This may be the reason why sklearn runs slightly faster. And this is very likely the reason of the increasing error after many power iterations, replying to @mblondel. From Halko et al.:

Unfortunately, when Algorithm 4.3 is executed in floating point arithmetic, rounding errors will extinguish all information pertaining to singular modes associated with singular values that are small compared with A . (Roughly, if machine precision is μ, then all information associated with singular values smaller
than μ 1/(2q+1) A is lost.)

The paper suggests QR factorizations anyway.

fbpca does not factor the cases m>n and m<n. I guess the support to complex input does not make it so straightforward, as it is in sklearn that does .T at the beginning.

ogrisel · 2015-09-01T10:51:01Z

This may be the reason why sklearn runs slightly faster.

Actually on LFW, sklearn is slightly slower, not faster. Maybe the final QR is less expensive when the LU is done in the intermediate steps? It would be great to confirm with an experiment by adding an option to add the LU steps in our codebase.

And this is very likely the reason of the increasing error after many power iterations.

This looks very likely indeed.

ogrisel · 2015-09-01T10:54:50Z

The implementation in dask.linalg by @marianotepper does vanilla power iterations (no LU):

https://github.com/blaze/dask/blob/master/dask/array/linalg.py#L225

mblondel · 2015-09-01T11:24:49Z

It would be interesting to compare QR and LU decompositions.

Another thing I am curious is whether it would be worth using a stopping criterion for the loop in the range finder. Halko et al. suggest ||A - QQ^TA||. This can however be expensive to compute.

ajtulloch · 2015-09-02T07:37:16Z

Cool, great discussion! LMK if we (Mark Tygert and I wrote fbpca) can contribute in any way. There's a fairly detailed evaluation at http://tygert.com/implement.pdf if that's useful as well.

giorgiop · 2015-09-02T12:15:27Z

Thanks for the very useful reference @ajtulloch! I am going to make a benchmark file for playing with different versions of the power iterations. I will keep comparing with your fbpca.

giorgiop · 2015-10-12T09:04:41Z

whats_new.rst updated.

amueller · 2015-10-12T20:44:46Z

lgtm.

kastnerkyle · 2015-10-20T09:08:42Z

This lgtm as well

glouppe · 2015-10-20T09:10:13Z

The what's new entries should be put under 0.18.

giorgiop · 2015-10-20T09:17:38Z

The what's new entries should be put under 0.18.

Sure. I was hoping for a merge in 0.17 ;)
I will update. I am also working on speeding up the benchmarks with the computation of approximate spectral norms as discussed above.

Also, we should have a common default policy for the power iterations (n_iter and n_oversampling) with the other PCA open PR ##5299. Don't we?

giorgiop · 2015-10-20T13:56:14Z

whats_new updated, with some other minors.

I did not implement the faster spectral norm estimate. That measure depends on random initialization, which make benchmarking trickier (need many runs and averaged performance etc.).

giorgiop · 2015-10-21T09:22:47Z

Done here.

ogrisel · 2015-10-21T11:00:33Z

+1 for merging this now and re-adjusting the default params for consistency in the PCA collapse PR if needed.

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd

ogrisel · 2015-10-21T11:02:05Z

🍋cello!

kastnerkyle · 2015-10-21T11:03:07Z

🍻

amueller · 2015-10-21T11:06:52Z

:craft beer:

amueller · 2015-10-21T11:09:33Z

doc/whats_new.rst

+      In practice this is often enough for obtaining a good approximation of the
+      true eigenvalues/vectors in the presence of noise. By `Giorgio Patrini`_.
+
+    - :func:`randomized_range_finder` is more numerically stable when many


can you please add a link to the PR to whatsnew? And versionadded hints for the changed parameters would be nice ;)

This you added this in the wrong diff @amueller

ok, now I get it @amueller . just give me time :)

lesteve reviewed Aug 20, 2015
View reviewed changes

giorgiop force-pushed the power-iter-randomized-svd branch 3 times, most recently from 4be01ff to b83adab Compare August 21, 2015 13:25

giorgiop mentioned this pull request Oct 20, 2015

Compare against / import fbpca #4629

Closed

giorgiop force-pushed the power-iter-randomized-svd branch from 7a9c3bc to e4fb7e9 Compare October 20, 2015 13:53

giorgiop force-pushed the power-iter-randomized-svd branch 5 times, most recently from 0a12481 to ce138f7 Compare October 21, 2015 09:21

randomized_svd: power iter, normalization, benchmark

b18f295

giorgiop force-pushed the power-iter-randomized-svd branch from ce138f7 to b18f295 Compare October 21, 2015 09:21

kastnerkyle changed the title ~~[MRG + 1] ENH: optimizing power iterations phase for randomized_svd~~ [MRG + 2] ENH: optimizing power iterations phase for randomized_svd Oct 21, 2015

ogrisel added a commit that referenced this pull request Oct 21, 2015

Merge pull request #5141 from giorgiop/power-iter-randomized-svd

0cb93b0

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd

ogrisel merged commit 0cb93b0 into scikit-learn:master Oct 21, 2015

amueller reviewed Oct 21, 2015
View reviewed changes

This was referenced Oct 21, 2015

DOC versionadded randomized_svd #5512

Merged

[MRG+1] Initialize ARPACK eigsh #5012

Merged

giorgiop deleted the power-iter-randomized-svd branch November 3, 2015 12:30

giorgiop restored the power-iter-randomized-svd branch February 21, 2016 22:29

tygert mentioned this pull request May 15, 2018

Reduce memory consumption for PCA when float32 matrix is specified. facebookarchive/fbpca#5

Closed

tygert mentioned this pull request Jul 16, 2018

Distributed / Spark version facebookarchive/fbpca#6

Closed

Uh oh!

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd #5141

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd #5141

Uh oh!

Conversation

giorgiop commented Aug 19, 2015

Uh oh!

giorgiop commented Aug 19, 2015

Uh oh!

ogrisel commented Aug 20, 2015

Uh oh!

lesteve Aug 20, 2015

Choose a reason for hiding this comment

Uh oh!

giorgiop Aug 20, 2015

Choose a reason for hiding this comment

Uh oh!

giorgiop commented Aug 21, 2015

Uh oh!

giorgiop commented Aug 21, 2015

Uh oh!

ogrisel commented Aug 24, 2015

Uh oh!

ogrisel commented Aug 24, 2015

Uh oh!

amueller commented Aug 24, 2015

Uh oh!

mblondel commented Aug 25, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

ogrisel commented Aug 31, 2015

Uh oh!

ogrisel commented Aug 31, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

ogrisel commented Aug 31, 2015

Uh oh!

ogrisel commented Aug 31, 2015

Uh oh!

giorgiop commented Aug 31, 2015

Uh oh!

kastnerkyle commented Aug 31, 2015

Uh oh!

amueller commented Aug 31, 2015

Uh oh!

ogrisel commented Sep 1, 2015

Uh oh!

giorgiop commented Sep 1, 2015

Uh oh!

ogrisel commented Sep 1, 2015

Uh oh!

ogrisel commented Sep 1, 2015

Uh oh!

mblondel commented Sep 1, 2015

Uh oh!

ajtulloch commented Sep 2, 2015

Uh oh!

giorgiop commented Sep 2, 2015

Uh oh!

giorgiop commented Oct 12, 2015

Uh oh!

amueller commented Oct 12, 2015

Uh oh!

kastnerkyle commented Oct 20, 2015

Uh oh!

glouppe commented Oct 20, 2015

Uh oh!

giorgiop commented Oct 20, 2015

Uh oh!

giorgiop commented Oct 20, 2015

Uh oh!

giorgiop commented Oct 21, 2015

Uh oh!

ogrisel commented Oct 21, 2015

Uh oh!

ogrisel commented Oct 21, 2015