Skip to content

[MRG] add lobpcg svd_solver to PCA and TruncatedSVD #12319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 119 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
4555c09
added lobpcg_svd solver to pca and truncated_svd
lobpcg Oct 7, 2018
ee76629
format editing added local lobpcg temporarily
lobpcg Oct 7, 2018
469fb78
lobpcg.py formatting
lobpcg Oct 8, 2018
4a2c802
fommatting lobpcg.py
lobpcg Oct 8, 2018
d429211
errors fixed
lobpcg Oct 8, 2018
9d05e3b
formatting/error fixes
lobpcg Oct 8, 2018
5867469
multiple editing
lobpcg Oct 8, 2018
f8b23c1
"fake" lobpcg edits, auto transpose in extmath, many test changes
lobpcg Oct 9, 2018
0d8b85d
STY: small fixes
rc Oct 9, 2018
2078cbe
FIX: move making operators after system reporting to fix the report
rc Oct 10, 2018
896b38d
ENH: port fixes from scipy/scipy #9352
rc Oct 10, 2018
7166dc6
FIX: preserve input matrix dtypes
rc Oct 10, 2018
b9c5bc9
FIX: typo in test_singular_values()
rc Oct 10, 2018
0a1fb0d
STY: fix long lines
rc Oct 10, 2018
de7d195
STY: add extra lines (PEP8)
rc Oct 10, 2018
6f27f4b
added test_pca_lobpcg_solver to test_pca
lobpcg Oct 10, 2018
ec46500
lobpcg added to test_explained_variance in test_pca
lobpcg Oct 10, 2018
73c5baf
typos fixed
lobpcg Oct 10, 2018
4b3e760
lobpcg added to test_singular_values in test_pca
lobpcg Oct 10, 2018
b12921d
doctest matrix output formatting fixed in lobpcg
lobpcg Oct 10, 2018
732453b
lobpcg.py overwrite residualTolerance =< 0.0
lobpcg Oct 10, 2018
dd53ee3
typo fixed in lobpcg.py
lobpcg Oct 10, 2018
e62c6bc
3 test_lobpcg_pca_* added to test_pca
lobpcg Oct 10, 2018
690fd32
no need to set tol in test_n_components_mle
lobpcg Oct 10, 2018
08ff007
doctest errors fixed in lobpcg.py
lobpcg Oct 10, 2018
7fe4bfa
FIX: make save(), as2d() private
rc Oct 11, 2018
45f3ec0
DOC: merge Other Parameters with Parameters, document all return values
rc Oct 11, 2018
b449d78
FIX: correct tol in _report_nonhermitian()
rc Oct 12, 2018
2146c81
Merge pull request #1 from scikit-learn/master
lobpcg Oct 12, 2018
6c9aa2d
default verbosityLevel=0
lobpcg Oct 12, 2018
2a50f38
cumulative first round of editing requested by @rth
lobpcg Oct 14, 2018
e4dff55
trailing whitespaces removed
lobpcg Oct 14, 2018
d156df4
DOC: fix typo
rc Oct 14, 2018
a2e1578
unrelated comment removed
lobpcg Oct 19, 2018
3518e94
Merge branch 'lobpcg_svd' of https://github.com/lobpcg/scikit-learn i…
lobpcg Oct 19, 2018
d94bb93
extmath lobpcg_svd changes:
lobpcg Oct 21, 2018
09e0443
sklearn\utils\extmath.py lobpcg_svd
lobpcg Oct 22, 2018
6d3eda4
trailed whitespace removed
lobpcg Oct 22, 2018
f21849e
PEP8 formatting
lobpcg Oct 22, 2018
6826f0b
comment edit
lobpcg Oct 22, 2018
2cf4322
PEP8 line edit
lobpcg Oct 22, 2018
8cb8101
trying to fix LGTM error Keyword argument 'matvec' is not a supported…
lobpcg Oct 23, 2018
f96faaa
trailing whitespace removed
lobpcg Oct 23, 2018
2fe5eb3
default changed to explicitNormalMatrix=True
lobpcg Nov 1, 2018
47c1406
indentation (comment) fix?
lobpcg Nov 1, 2018
e8f6b48
ENH: support properly B=None branch in lobpcg() - do not compute unne…
rc Nov 4, 2018
2aeb959
STY: remove unused import
rc Nov 5, 2018
39cbe8b
ENH: speed-up LinearOperator in lobpcg_svd() by providing matmat()
rc Nov 5, 2018
3081252
STY: fix PEP8 errors
rc Nov 5, 2018
923ea6f
explicitNormalMatrix=None auto-choce added, comments edited
lobpcg Nov 5, 2018
0fcc8cc
typos fixed
lobpcg Nov 5, 2018
783b60e
FIX: update LinearOperator construction in lobpcg_svd() for old SciPy…
rc Nov 5, 2018
8e88087
FIX: remove precision argument in _save() (does not exist in recent n…
rc Jan 6, 2019
1a8a552
Merge pull request #2 from scikit-learn/master
lobpcg Jan 26, 2019
35d09a7
Merge pull request #4 from lobpcg/master
lobpcg Jan 26, 2019
ba17069
Update lobpcg.py
lobpcg Jan 28, 2019
19e7c24
sync with scipy version
rc Jan 28, 2019
3d72f93
fix escape sequence problem
rc Jan 28, 2019
421fa8f
fix docstring line lengths
rc Jan 28, 2019
4b3f74b
Merge pull request #5 from scikit-learn/master
lobpcg Feb 19, 2019
c874474
Merge pull request #7 from lobpcg/master
lobpcg Feb 19, 2019
9426b4b
Merge pull request #8 from scikit-learn/master
lobpcg Mar 5, 2019
19cbe70
Merge pull request #10 from lobpcg/master
lobpcg Mar 5, 2019
abdb873
Merge pull request #11 from scikit-learn/master
lobpcg Apr 11, 2019
95fc261
Merge pull request #13 from lobpcg/master
lobpcg Apr 11, 2019
32b015f
lobpcg moved from utils to externals
lobpcg Apr 16, 2019
d4d5d32
_parse_version error fixes
lobpcg Apr 17, 2019
34b9598
comment on normal matrix clarified
lobpcg Apr 17, 2019
70cf7ca
consistent tol setup everywhere in lobpcg
lobpcg Apr 18, 2019
445d06d
combined arpack randomized and lobpcg tests into 1
lobpcg Apr 18, 2019
ed79fb8
scikit-learn/pull/12319#discussion_r276985457
lobpcg Apr 19, 2019
5e9beb1
import scipy.__version__ fix
lobpcg Apr 19, 2019
b75bd34
Merge pull request #14 from scikit-learn/master
lobpcg May 25, 2019
99a0a8b
Merge pull request #15 from lobpcg/master
lobpcg May 25, 2019
879c8a1
MAINT refactor randomized and lobpcg SVD
glemaitre Jun 20, 2019
f6dd5cd
iter
glemaitre Jun 20, 2019
8d86223
xxx
glemaitre Jun 21, 2019
ed8473e
xxx
glemaitre Jun 21, 2019
7956cec
fix
glemaitre Jun 21, 2019
450ff0e
Merge remote-tracking branch 'origin/master' into pr/lobpcg/12319
glemaitre Jun 26, 2019
79aea6b
TST add lobpcg in the test for PCA
glemaitre Jun 26, 2019
7ce7a3e
TST add lobpcg to TruncatedSVD
glemaitre Jun 26, 2019
031fd67
FIX: vendor lobpcg for scipy < 1.3
glemaitre Jun 26, 2019
b1cacfe
PEP8
glemaitre Jun 26, 2019
c0565fa
update lobpcg
glemaitre Jun 26, 2019
28e3fe3
DOC add whats new
glemaitre Jun 26, 2019
a6ddc1f
fixes
glemaitre Jun 26, 2019
7312469
update really with the master version
glemaitre Jun 26, 2019
467f250
Merge branch 'is/upstream_lobpcg' into pr/lobpcg/12319
glemaitre Jun 26, 2019
d3749ed
DOC add whats new
glemaitre Jun 26, 2019
2e7e510
PEP8
glemaitre Jun 26, 2019
169c462
port the bmat utils from scipy
glemaitre Jun 26, 2019
711882b
port the bmat utils from scipy
glemaitre Jun 26, 2019
e8081e4
iter
glemaitre Jun 26, 2019
1c26637
Merge branch 'is/upstream_lobpcg' into pr/lobpcg/12319
glemaitre Jun 26, 2019
70263d3
iter
glemaitre Jun 26, 2019
60536f5
Merge remote-tracking branch 'origin/master' into pr/lobpcg/12319
glemaitre Jun 26, 2019
39327a6
DOC: improve documentation PCA
glemaitre Jun 26, 2019
95a81c7
iter
glemaitre Jun 27, 2019
26b3cdb
remove debug
glemaitre Jun 27, 2019
bbedba3
PEP8
glemaitre Jun 27, 2019
3cc6da1
additinal assert
glemaitre Jun 27, 2019
2cc4798
iter on TruncatedSVD
glemaitre Jun 27, 2019
1c9cff8
remove useless code for supported scipy version
glemaitre Jun 27, 2019
cde9184
TST validation preconditioner in randomized_svd
glemaitre Jun 27, 2019
6ef8dad
TST refactor some test
glemaitre Jun 27, 2019
30e4497
Merge remote-tracking branch 'origin/master' into pr/lobpcg/12319
glemaitre Jul 16, 2019
b2709b9
do not support integer with lobpcg
glemaitre Jul 16, 2019
d133ae9
PEP8
glemaitre Jul 16, 2019
6d616ee
Merge branch 'master' into lobpcg_svd
lobpcg Aug 30, 2019
aec250f
Update sklearn/utils/extmath.py
lobpcg Aug 30, 2019
c20d809
Merge branch 'master' into lobpcg_svd
lobpcg Sep 10, 2019
bbf902a
Merge branch 'master' into lobpcg_svd
lobpcg Sep 26, 2019
6a03d77
Merge remote-tracking branch 'origin/master' into pr/lobpcg/12319
glemaitre Oct 24, 2019
fb1dbb7
Merge remote-tracking branch 'lobpcg/lobpcg_svd' into pr/lobpcg/12319
glemaitre Oct 24, 2019
4da2e77
Merge remote-tracking branch 'upstream/master' into pr/12319
thomasjpfan Nov 29, 2019
bd1755e
remove duplicated References section
lobpcg Nov 30, 2019
1ef74b1
Notes fixed
lobpcg Dec 1, 2019
c3b612b
fixing the docstring format
lobpcg Dec 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions doc/whats_new/v0.22.rst
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,12 @@ Changelog
underlying :class:`linear_model.LassoLars` when `algorithm='lasso_lars'`.
:issue:`12650` by `Adrin Jalali`_.

- |Enhancement| :class:`decomposition.PCA` and
:class:`decomposition.TruncatedSVD` has a new solver `lobpcg` which
accelerate computation and add a better accuracy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add randomized_svd here.

:pr:`12319` by :user:`Andrew Knyazev <lobpcg>` and
:user:`Guillaume Lemaitre <glemaitre>`.

:mod:`sklearn.dummy`
....................

Expand Down
3 changes: 2 additions & 1 deletion sklearn/cluster/_spectral.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,8 @@ class SpectralClustering(ClusterMixin, BaseEstimator):

eigen_tol : float, optional, default: 0.0
Stopping criterion for eigendecomposition of the Laplacian matrix
when ``eigen_solver='arpack'``.
when 'arpack' or `lobpcg` eigen_solver. tol = .0 in 'lobpcg' is
ignored and substituted by a local default in LOBPCG.

assign_labels : {'kmeans', 'discretize'}, default: 'kmeans'
The strategy to use to assign labels in the embedding
Expand Down
147 changes: 95 additions & 52 deletions sklearn/decomposition/_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Denis A. Engemann <denis-alexander.engemann@inria.fr>
# Michael Eickenberg <michael.eickenberg@inria.fr>
# Giorgio Patrini <giorgio.patrini@anu.edu.au>
#
# Andrew Knyazev <andrew.knyazev@ucdenver.edu>
# License: BSD 3 clause

from math import log, sqrt
Expand Down Expand Up @@ -108,12 +108,14 @@ class PCA(_BasePCA):
data to project it to a lower dimensional space. The input data is centered
but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated
SVD by the method of Halko et al. 2009, depending on the shape of the input
data and the number of components to extract.
By default, it uses the LAPACK implementation of the full SVD or a
randomized truncated SVD [3], depending on the shape of the input data and
the number of components to extract.

It can also use the scipy.sparse.linalg ARPACK implementation of the
truncated SVD.
In addition, one can also use the ARPACK implementation of the truncated
SVD (refer to :func:`scipy.sparse.linalg.svds`) or the randomized truncated
SVD with an additional preconditioning called LOBPCG [5] (refer to
:func:`scipy.sparse.linalg.lobpcg`).

Notice that this class does not support sparse input. See
:class:`TruncatedSVD` for an alternative with sparse data.
Expand All @@ -122,7 +124,7 @@ class PCA(_BasePCA):

Parameters
----------
n_components : int, float, None or str
n_components : int, float, None or string, default=None
Number of components to keep.
if n_components is not set all components are kept::

Expand All @@ -148,7 +150,7 @@ class PCA(_BasePCA):
fit(X).transform(X) will not yield the expected results,
use fit_transform(X) instead.

whiten : bool, optional (default False)
whiten : bool, default=False
When True (False by default) the `components_` vectors are multiplied
by the square root of n_samples and then divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.
Expand All @@ -158,42 +160,53 @@ class PCA(_BasePCA):
improve the predictive accuracy of the downstream estimators by
making their data respect some hard-wired assumptions.

svd_solver : str {'auto', 'full', 'arpack', 'randomized'}
If auto :
The solver is selected by a default policy based on `X.shape` and
svd_solver : string {'auto', 'full', 'arpack', 'randomized', 'lobpcg'}
auto (default) :
the solver is selected by a default policy based on `X.shape` and
`n_components`: if the input data is larger than 500x500 and the
number of components to extract is lower than 80% of the smallest
dimension of the data, then the more efficient 'randomized'
method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.
If full :
run exact full SVD calling the standard LAPACK solver via
`scipy.linalg.svd` and select the components by postprocessing
:func:`scipy.linalg.svd` and select the components by
postprocessing.
If arpack :
run SVD truncated to n_components calling ARPACK solver via
`scipy.sparse.linalg.svds`. It requires strictly
0 < n_components < min(X.shape)
:func:`scipy.sparse.linalg.svds`. It requires strictly
0 < n_components < min(X.shape).
If randomized :
run randomized SVD by the method of Halko et al.
run randomized SVD as in [2].
If lobpcg :
run Locally Optimal Block Preconditioned Conjugate Gradient [5]
for a normal matrix X'*X or X*X', whichever of the two is of
the smallest size. See :func:`scipy.sparse.linalg.lobpcg`.

.. versionadded:: 0.18.0

tol : float >= 0, optional (default .0)
Tolerance for singular values computed by svd_solver == 'arpack'.
tol : None or float, default=None
Tolerance for singular values computed by the 'randomized' and 'lobpcg'
SVD solver. If None, then:
* `tol = 2 * eps` for svd_solver = 'randomized'.
Refer to :func:`scipy.sparse.linalg.svds`.
* `tol = n * sqrt(eps)` where `n = min(n_samples, n_features)`.
Refer to :func:`scipy.sparse.linalg.lobpcg`.

.. versionadded:: 0.18.0

iterated_power : int >= 0, or 'auto', (default 'auto')
Number of iterations for the power method computed by
svd_solver == 'randomized'.
iterated_power : int or 'auto', default='auto'
Number of iterations of svd_solver = 'lobpcg' or for the power method
computed by svd_solver == 'randomized'.

.. versionadded:: 0.18.0

random_state : int, RandomState instance or None, optional (default None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`. Used when ``svd_solver`` == 'arpack' or 'randomized'.
by `np.random` in ``svd_solver`` == 'arpack', 'randomized',
or 'lobpcg'.

.. versionadded:: 0.18.0

Expand Down Expand Up @@ -244,15 +257,24 @@ class PCA(_BasePCA):
Number of samples in the training data.

noise_variance_ : float
The estimated noise covariance following the Probabilistic PCA model
from Tipping and Bishop 1999. See "Pattern Recognition and
Machine Learning" by C. Bishop, 12.2.1 p. 574 or
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
The estimated noise covariance following the implementation [2].
See Sect. 12.2.1, pp. 574 of [6] or [5]. It is required to
compute the estimated data covariance and score samples.

Equal to the average of (min(n_features, n_samples) - n_components)
smallest eigenvalues of the covariance matrix of X.

Notes
-----
For n_components == 'mle', this class uses the method of [1].
`score` and `score_samples` implement the probabilistic
PCA model from [2].
For other solvers:
* svd_solver == 'arpack', refers to :func:`scipy.sparse.linalg.svds`.
* svd_solver == 'randomized', refers to the implementation in
[3] and [4].
* svd_solver == 'lobpcg', refers to the implementation in [5].

See Also
--------
KernelPCA : Kernel Principal Component Analysis.
Expand All @@ -262,26 +284,33 @@ class PCA(_BasePCA):

References
----------
For n_components == 'mle', this class uses the method of *Minka, T. P.
"Automatic choice of dimensionality for PCA". In NIPS, pp. 598-604*

Implements the probabilistic PCA model from:
Tipping, M. E., and Bishop, C. M. (1999). "Probabilistic principal
component analysis". Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(3), 611-622.
via the score and score_samples methods.
See http://www.miketipping.com/papers/met-mppca.pdf

For svd_solver == 'arpack', refer to `scipy.sparse.linalg.svds`.

For svd_solver == 'randomized', see:
*Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).
"Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions".
SIAM review, 53(2), 217-288.* and also
*Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011).
"A randomized algorithm for the decomposition of matrices".
Applied and Computational Harmonic Analysis, 30(1), 47-68.*
.. [1] Minka, Thomas P. "Automatic choice of dimensionality for PCA."
In Advances in neural information processing systems,
pp. 598-604. 2001.

.. [2] Tipping, Michael E., and Christopher M. Bishop.
"Probabilistic principal component analysis."
Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 61, no. 3 (1999): 611-622.
http://www.miketipping.com/papers/met-mppca.pdf

.. [3] Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp.
"Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions."
SIAM review 53, no. 2 (2011): 217-288.

.. [4] Martinsson, Per-Gunnar, Vladimir Rokhlin, and Mark Tygert.
"A randomized algorithm for the decomposition of matrices."
Applied and Computational Harmonic Analysis 30, no. 1 (2011): 47-68.

.. [5] Knyazev, Andrew V.
"Toward the optimal preconditioned eigensolver:
Locally optimal block preconditioned conjugate gradient method."
SIAM journal on scientific computing 23, no. 2 (2001): 517-541.
https://doi.org/10.1137%2FS1064827500366124

.. [6] Bishop, Christopher M.
"Pattern recognition and machine learning." Springer, 2006.

Examples
--------
Expand Down Expand Up @@ -366,7 +395,7 @@ def fit_transform(self, X, y=None):
This method returns a Fortran-ordered array. To convert it to a
C-ordered array, use 'np.ascontiguousarray'.
"""
U, S, V = self._fit(X)
U, S, _ = self._fit(X)
U = U[:, :self.n_components_]

if self.whiten:
Expand Down Expand Up @@ -407,15 +436,21 @@ def _fit(self, X):
self._fit_svd_solver = 'full'
elif n_components >= 1 and n_components < .8 * min(X.shape):
self._fit_svd_solver = 'randomized'
# need to add 'lobpcg' here
# This is also the case of n_components in (0,1)
else:
self._fit_svd_solver = 'full'

# Set the default tolerance if necessary
tol = (0. if self.tol is None and self._fit_svd_solver == 'arpack'
else self.tol)

# Call different fits for either full or truncated SVD
if self._fit_svd_solver == 'full':
return self._fit_full(X, n_components)
elif self._fit_svd_solver in ['arpack', 'randomized']:
return self._fit_truncated(X, n_components, self._fit_svd_solver)
elif self._fit_svd_solver in ['arpack', 'randomized', 'lobpcg']:
return self._fit_truncated(X, n_components, self._fit_svd_solver,
tol)
else:
raise ValueError("Unrecognized svd_solver='{0}'"
"".format(self._fit_svd_solver))
Expand Down Expand Up @@ -483,9 +518,9 @@ def _fit_full(self, X, n_components):

return U, S, V

def _fit_truncated(self, X, n_components, svd_solver):
"""Fit the model by computing truncated SVD (by ARPACK or randomized)
on X
def _fit_truncated(self, X, n_components, svd_solver, tol):
"""Fit the model by computing truncated SVD
(by ARPACK, randomized, or LOBPCG) on X
"""
n_samples, n_features = X.shape

Expand Down Expand Up @@ -520,7 +555,7 @@ def _fit_truncated(self, X, n_components, svd_solver):
if svd_solver == 'arpack':
# random init solution, as ARPACK does it internally
v0 = random_state.uniform(-1, 1, size=min(X.shape))
U, S, V = svds(X, k=n_components, tol=self.tol, v0=v0)
U, S, V = svds(X, k=n_components, tol=tol, v0=v0)
# svds doesn't abide by scipy.linalg.svd/randomized_svd
# conventions, so reverse its outputs.
S = S[::-1]
Expand All @@ -534,6 +569,14 @@ def _fit_truncated(self, X, n_components, svd_solver):
flip_sign=True,
random_state=random_state)

elif svd_solver == 'lobpcg':
# sign flipping is done inside
U, S, V = randomized_svd(
X, n_components=n_components, n_iter=self.iterated_power,
flip_sign=True, random_state=random_state,
preconditioner='lobpcg', tol=tol
)

self.n_samples_, self.n_features_ = n_samples, n_features
self.components_ = V
self.n_components_ = n_components
Expand Down
Loading