Skip to content

[MRG + 2] ENH: optimizing power iterations phase for randomized_svd #5141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 21, 2015

Conversation

giorgiop
Copy link
Contributor

I am writing some benchmarks to see if we can set a default number of power iterations for `utils.extmath.randomized_SVD'. The function implements the work of Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions, Halko, et al., 2009 http://arxiv.org/abs/arXiv:0909.4061

See the code for mode details.

Still to do:

  • review unit tests for randomized_svd
  • benchmark if we can find an absolute number regardless of input matrix rank , e.g. is 1 always better than 0. In other words, can we find a small integer such that we always improve approximation quality while not making the algorithm significantly slower?
  • extract at least 5 components in every experiment
  • extend randomized_svd with normalized power iterations with QR at each step, and another version with LU
  • sample the random matrix from U[-1,1] instead of from a Gaussian
  • docs

And finally

  • set a default value for n_iter in randomized_svd and a default strategy for normalizing matrices between power iterations
  • review all use of randomized_svd (in TruncatedSVD and RandomizedPCA)

@giorgiop
Copy link
Contributor Author

  • singular values distance = squared L2 norm of the difference between singular values from standard SVD and approximated ones from randomized_svd
  • Frobenius distance = Frobenius norm of the difference between original input matrix and matrix product USV, output of randomized_svd
  • noise component = parameter in [0,1] input of `make_low_rank_matrix
  • the size of the dots in the last plots is proportional to the number of power iterations used for that run --growing to the right

figure_1
figure_2
figure_3
figure_4
figure_5
figure_6
figure_7
figure_8
figure_9
figure_10

So far, plots suggest 2 may be the choice. Larger number may even degrade quality of the approximation. The flat behaviour of the last two plots come from very small datasets.

Ping @ogrisel

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2015

This looks great although the plots on diabetes and abalone are not very interesting as they extract a single component. Maybe you should force to extract at least 5 components to make them more relevant.

plt.legend(loc="lower right")
plt.axhline(y=0, ls=':', c='black')
plt.suptitle("%s: distances from SVD vs. running time\n \
fraction of compontens used in rand_SVD = %i" %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: components.

Also from your plots, it looks like this is the number of components and not the fraction right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I always take 5%. That's the absolute number. Thanks.

@giorgiop giorgiop force-pushed the power-iter-randomized-svd branch 3 times, most recently from 4be01ff to b83adab Compare August 21, 2015 13:25
@giorgiop
Copy link
Contributor Author

scipy.sparse.linalg.svds does decrease a little time consumption, but the singular values and singular vectors returned are sorted differently from randomized_SVD and scipy.linald.svd. I would stick with the original, there is not much gain here anyway.

I have coded a last test. Looking to the 2 new plots, I think we can be confident that 2 is a good default value. Input was (500, 5000). Rank is the rank of the input matrix, n_comps is the parameter of randomized_svd.

figure_19
figure_20

@giorgiop
Copy link
Contributor Author

For completeness, here the other plots updated. (There is no change on the small datasets, even computing 5 components.)

figure_1
figure_2
figure_3
figure_4
figure_5
figure_6
figure_7
figure_8
figure_9
figure_10

@ogrisel
Copy link
Member

ogrisel commented Aug 24, 2015

This is great. Thanks for this empirical study. +1 for using 2 power iterations in randomized_svd by default.

@ogrisel
Copy link
Member

ogrisel commented Aug 24, 2015

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

  • on the plus side: this would give a non-negligible speed up with similar accuracy on most datasets,
  • on the negative side: it can cause a slight change of behavior that users upgrading from a previous version of scikit-learn might not expect.

Any opinion by others, e.g. @larsmans?

@amueller
Copy link
Member

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

@mblondel
Copy link
Member

Another thing which would be nice to investigate if you have time: our implementation of range_finder uses Algorithm 4.3 but according to the paper the procedure is numerically unstable. The paper suggests a subspace iteration method in Algorithm 4.4. It would be interesting to compare both algorithms in terms of accuracy and computational time.

@giorgiop
Copy link
Contributor Author

How do we compare against fbpca? https://github.com/facebook/fbpca ? That also just uses power iterations, right?

It does. And it also uses 2 power iterations by default.

I ran a few tests. Sklearn seems slower. Performance is similar, but sklearn is slightly better here. A difference is that fbpca's performance does not deteriorate when the number of power iterations is "too" large. (I did not looked into the code.)

Code on this gist. Graphs attached.

figure_1
figure_2

@giorgiop
Copy link
Contributor Author

I wonder if we should change the default value for RandomizedPCA and TruncatedSVD from 3 to 2 as well:

What people think about this? @ogrisel @larsmans I am happy to have a look into it.

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2015

Sklearn seems slower. Performance is similar, but sklearn is slightly better here.

I don't understand this, could you please rephrase?

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2015

Also could you please annotate the scatter plot with the number of power iterations for each dot?

See for instance: http://stackoverflow.com/questions/5147112/matplotlib-how-to-put-individual-tags-for-a-scatter-plot (no need for arrows)

@giorgiop
Copy link
Contributor Author

Just commenting on the two plots. Sklearn runs slower than fbpca. But performance in term of matrix factorization error is slighly better than fbpca --until sklearn starts to degrade, for lfw_people.

Also could you please annotate the scatter plot with the number of power iterations for each dot?

Sure.

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2015

The difference is how the power iterations are computed, see:

https://github.com/facebook/fbpca/blob/master/fbpca.py#L1562
vs
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L226

A LU factorization lu(Q, permute_l=True) before each power iteration.

It would be worth checking the paper: maybe our code does not do it because of an oversight on our part. It seems to make the power iteration more stable: too many power iterations do not seem to cause a degradation in the quality of the approximation when the LU steps are there.

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2015

Maybe @ajtulloch would be interested in following this discussion :)

@giorgiop
Copy link
Contributor Author

Plots updates
figure_1
figure_2

@kastnerkyle
Copy link
Member

I think fbpca switches "modes" depending on the number of components - IIRC
from looking at the code they switch to doing "full" pca and truncating if
the number of components is >X% of the total. Are we comparing only to the
randomized solver of fbpca or to the external interface (which may switch
modes internally)?

On Mon, Aug 31, 2015 at 10:32 AM, Giorgio Patrini notifications@github.com
wrote:

Plots updates
[image: figure_1]
https://cloud.githubusercontent.com/assets/2871319/9581136/d6fa1710-4ffd-11e5-89da-de7cef990135.png
[image: figure_2]
https://cloud.githubusercontent.com/assets/2871319/9581137/d6fe918c-4ffd-11e5-804e-94f612e59d16.png


Reply to this email directly or view it on GitHub
#5141 (comment)
.

@amueller
Copy link
Member

and should we also switch modes internally?

@ogrisel
Copy link
Member

ogrisel commented Sep 1, 2015

I think fbpca switches "modes" depending on the number of components - IIRC from looking at the code they switch to doing "full" pca and truncating if the number of components is >X% of the total. Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

This is only the case when n_components is very close min(n_samples, n_features). We could implement that strategy in sklearn as well (as a safety belt) but in practice I don't think it matters as on large enough data you can only do PCA with n_components << min(n_samples, n_features) (otherwise the SVD is too expensive).

@giorgiop
Copy link
Contributor Author

giorgiop commented Sep 1, 2015

Looking into the code, the 3 differences I see are:

  • mode switching when n_comp > 4/5 min(n_samples, n_features).

Are we comparing only to the randomized solver of fbpca or to the external interface (which may switch modes internally)?

I am comparing with the external interface pca but I do not think there is any difference, except passing through n_comp > 4/5 min(n_samples, n_features). This was always false in what I ran.

  • a LU factorization between every multiplication of A and A^T in the power iterations. This may be the reason why sklearn runs slightly faster. And this is very likely the reason of the increasing error after many power iterations, replying to @mblondel. From Halko et al.:

Unfortunately, when Algorithm 4.3 is executed in floating point arithmetic, rounding errors will extinguish all information pertaining to singular modes associated with singular values that are small compared with A . (Roughly, if machine precision is μ, then all information associated with singular values smaller
than μ 1/(2q+1) A is lost.)

The paper suggests QR factorizations anyway.

  • fbpca does not factor the cases m>n and m<n. I guess the support to complex input does not make it so straightforward, as it is in sklearn that does .T at the beginning.

@ogrisel
Copy link
Member

ogrisel commented Sep 1, 2015

This may be the reason why sklearn runs slightly faster.

Actually on LFW, sklearn is slightly slower, not faster. Maybe the final QR is less expensive when the LU is done in the intermediate steps? It would be great to confirm with an experiment by adding an option to add the LU steps in our codebase.

And this is very likely the reason of the increasing error after many power iterations.

This looks very likely indeed.

@ogrisel
Copy link
Member

ogrisel commented Sep 1, 2015

The implementation in dask.linalg by @marianotepper does vanilla power iterations (no LU):

https://github.com/blaze/dask/blob/master/dask/array/linalg.py#L225

@mblondel
Copy link
Member

mblondel commented Sep 1, 2015

It would be interesting to compare QR and LU decompositions.

Another thing I am curious is whether it would be worth using a stopping criterion for the loop in the range finder. Halko et al. suggest ||A - QQ^TA||. This can however be expensive to compute.

@ajtulloch
Copy link
Contributor

Cool, great discussion! LMK if we (Mark Tygert and I wrote fbpca) can contribute in any way. There's a fairly detailed evaluation at http://tygert.com/implement.pdf if that's useful as well.

@giorgiop
Copy link
Contributor Author

giorgiop commented Sep 2, 2015

Thanks for the very useful reference @ajtulloch! I am going to make a benchmark file for playing with different versions of the power iterations. I will keep comparing with your fbpca.

@giorgiop
Copy link
Contributor Author

whats_new.rst updated.

@amueller
Copy link
Member

lgtm.

@kastnerkyle
Copy link
Member

This lgtm as well

@glouppe
Copy link
Contributor

glouppe commented Oct 20, 2015

The what's new entries should be put under 0.18.

@giorgiop
Copy link
Contributor Author

The what's new entries should be put under 0.18.

Sure. I was hoping for a merge in 0.17 ;)
I will update. I am also working on speeding up the benchmarks with the computation of approximate spectral norms as discussed above.

Also, we should have a common default policy for the power iterations (n_iter and n_oversampling) with the other PCA open PR ##5299. Don't we?

@giorgiop giorgiop force-pushed the power-iter-randomized-svd branch from 7a9c3bc to e4fb7e9 Compare October 20, 2015 13:53
@giorgiop
Copy link
Contributor Author

whats_new updated, with some other minors.

I did not implement the faster spectral norm estimate. That measure depends on random initialization, which make benchmarking trickier (need many runs and averaged performance etc.).

@giorgiop giorgiop force-pushed the power-iter-randomized-svd branch 5 times, most recently from 0a12481 to ce138f7 Compare October 21, 2015 09:21
@giorgiop giorgiop force-pushed the power-iter-randomized-svd branch from ce138f7 to b18f295 Compare October 21, 2015 09:21
@giorgiop
Copy link
Contributor Author

Done here.

@kastnerkyle kastnerkyle changed the title [MRG + 1] ENH: optimizing power iterations phase for randomized_svd [MRG + 2] ENH: optimizing power iterations phase for randomized_svd Oct 21, 2015
@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2015

+1 for merging this now and re-adjusting the default params for consistency in the PCA collapse PR if needed.

ogrisel added a commit that referenced this pull request Oct 21, 2015
[MRG + 2] ENH: optimizing power iterations phase for randomized_svd
@ogrisel ogrisel merged commit 0cb93b0 into scikit-learn:master Oct 21, 2015
@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2015

🍋cello!

@kastnerkyle
Copy link
Member

🍻

@amueller
Copy link
Member

:craft beer:

In practice this is often enough for obtaining a good approximation of the
true eigenvalues/vectors in the presence of noise. By `Giorgio Patrini`_.

- :func:`randomized_range_finder` is more numerically stable when many
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a link to the PR to whatsnew? And versionadded hints for the changed parameters would be nice ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This you added this in the wrong diff @amueller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I get it @amueller . just give me time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.