-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH: Faster Eigen Decomposition For Isomap & KernelPCA #31247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Faster Eigen Decomposition For Isomap & KernelPCA #31247
Conversation
…ith tests and integration into Isomap and KernelPCA
…into feature/randomized_eigsh_value
Thanks for the PR @yaichm. Please follow the instructions of the automated comment above to resolve the failing linter continuous integration. If you need help (e.g. the instructions are not clear enough), please let us know with a specific description of your attempt at resolving the problems and the problem you faced.
Looking forward to it. Please feel free to ping me once done. |
I’ve followed the automated instructions and fixed the linter issues. The benchmark result graphs have also been added. @ogrisel , feel free to review whenever you are available. |
Thanks to the team for finalizing these results ! So to summarize
|
As requested by @smarie, here is the image showing the reconstruction error for 50 components. |
@ogrisel I think the team is now ready for a first review (see summary #31247 (comment)) |
@@ -198,10 +198,6 @@ def test_randomized_eigsh(dtype): | |||
# eigenvectors | |||
assert eigvecs.shape == (4, 2) | |||
|
|||
# with 'value' selection method, the negative eigenvalue does not show up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace this section with
eigvals, eigvecs = _randomized_eigsh(X, n_components=2, selection="value")
# eigenvalues
assert eigvals.shape == (2,)
assert_array_almost_equal(eigvals, [3.0, 1.0]) # signed ordering: positive eig
# eigenvectors
assert eigvecs.shape == (4, 2)
# make a random PSD matrix | ||
X = make_sparse_spd_matrix(n_features, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is PSD, but we should also check non-PSD. So let's create a symmetric random matrix instead. If you wish you can test the two cases
- add a
@pytest.mark.parametrize("is_psd", (True, False))
- in the test, switch on
if is_psd:
to create the random X.
with bounded error. Unlike the 'module' strategy, it works efficiently with | ||
non-positive semidefinite matrices, handling both positive and negative | ||
eigenvalues directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather suggest to have a simple comment here :
with bounded error. Unlike the 'module' strategy, it works efficiently with | |
non-positive semidefinite matrices, handling both positive and negative | |
eigenvalues directly. | |
with bounded error. Unlike the 'module' strategy, it returns the top `k` eigenvalues by decreasing value: all the positive ones first, then the negative ones if any. |
And to add a more detailed comment in the Strategy "module"
description:
Strategy 'module':
(...existing definition...)
Unlike the 'value' strategy, this returns the top `k` eigenvalues by decreasing **module**. Therefore, when `M` is non-positive semidefinite large negative eigenvalues will be returned before small positive ones.
When `M` is psd both strategies lead to the same results as all eigenvalues are positive.
# -- eigenvectors comparison | ||
assert eigvecs_lapack.shape == (n_features, k) | ||
dummy_vecs = np.zeros_like(eigvecs).T | ||
eigvecs, _ = svd_flip(eigvecs, dummy_vecs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eigvecs, _ = svd_flip(eigvecs, dummy_vecs) | |
# Fix the signs so that both have the same | |
eigvecs, _ = svd_flip(eigvecs, dummy_vecs) |
@@ -169,7 +171,7 @@ class Isomap(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator): | |||
"n_neighbors": [Interval(Integral, 1, None, closed="left"), None], | |||
"radius": [Interval(Real, 0, None, closed="both"), None], | |||
"n_components": [Interval(Integral, 1, None, closed="left")], | |||
"eigen_solver": [StrOptions({"auto", "arpack", "dense"})], | |||
"eigen_solver": [StrOptions({"auto", "arpack", "dense", "randomized_value"})], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"randomized"
could probably be more straightforward as a parameter name.
@@ -57,6 +57,8 @@ class Isomap(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator): | |||
'dense' : Use a direct solver (i.e. LAPACK) | |||
for the eigenvalue decomposition. | |||
|
|||
'randomized_value' : Use randomized solver in order to reduce complexity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"randomized" could probably be more straightforward as a parameter name.
@@ -263,7 +263,9 @@ class KernelPCA(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator | |||
"kernel_params": [dict, None], | |||
"alpha": [Interval(Real, 0, None, closed="left")], | |||
"fit_inverse_transform": ["boolean"], | |||
"eigen_solver": [StrOptions({"auto", "dense", "arpack", "randomized"})], | |||
"eigen_solver": [ | |||
StrOptions({"auto", "dense", "arpack", "randomized", "randomized_value"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really add randomized_value
to kernel pca solvers, it should be declared in the docstring of the parameters (as for Isomap). If we do this it should be clear that we expect the results to be equivalent in speed and in quality to the current "randomized"(svd) method, except when a precomputed Kernel matrix is provided and is actually not PSD. In this case current "randomized" may not lead to correct results, or may even raise an error (if a negative eigenvalue is returned by the solver)
@@ -363,6 +365,14 @@ def _fit_transform_in_place(self, K): | |||
random_state=self.random_state, | |||
selection="module", | |||
) | |||
elif eigen_solver == "randomized_value": | |||
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add comments to this branch and the previous branch of the if so that it is easier to remember the difference while maintaining the code
# Use selection='module' for SVD decomposition (wrong if non-PSD precomputed kernel)
and
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh( | |
# Use selection='value' for direct eigenvalue decomposition (robust to non-PSD) | |
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh( |
random_state=None, | ||
): | ||
""" | ||
Approximate eigenvalue decomposition of a Hermitian matrix A ≈ U Λ U*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to core devs : is symbol ≈ allowed in docstrings ?
Approximate eigenvalue decomposition of a Hermitian matrix A ≈ U Λ U*. | |
Approximate eigenvalue decomposition of a Hermitian matrix A ≈ U Λ U*. |
**Halko, Martinsson, and Tropp (2011)**: *Finding Structure with Randomness: Probabilistic Algorithms for Constructing | ||
Approximate Matrix Decompositions*. This method provides a faster alternative to existing eigen decomposition techniques. | ||
|
||
- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and | |
- Integrated :func:`randomized_eigen_decomposition` into :class:`sklearn.manifold.Isomap` and |
- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and | ||
:class:`sklearn.decomposition.KernelPCA` as an additional option for eigen decomposition. | ||
|
||
- Added a test suite comparing the new method to existing solvers (:obj:`arpack`, :obj:`dense`, etc.), ensuring numerical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added a test suite comparing the new method to existing solvers (:obj:`arpack`, :obj:`dense`, etc.), ensuring numerical | |
- Added a test suite comparing the new method to existing solvers (``'arpack'``, ``'dense'``, etc.), ensuring numerical |
by :user: `Sylvain Marié<@smarie>`, `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`, | ||
`Hamza Zeroual<@HamzaLuffy>` and `Amine Hannoun<@AmineHannoun>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You guys did most of the work and deserve to appear first :)
by :user: `Sylvain Marié<@smarie>`, `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`, | |
`Hamza Zeroual<@HamzaLuffy>` and `Amine Hannoun<@AmineHannoun>`. | |
by `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`, | |
`Hamza Zeroual<@HamzaLuffy>`, `Amine Hannoun<@AmineHannoun>` and :user: `Sylvain Marié<@smarie>`. |
@@ -0,0 +1,17 @@ | |||
Feature | |||
------------ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
What you can observe: | ||
---------------------- | ||
- The `auto` solver provides a reference solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'auto' behaviour should rather be improved to rely on some heuristic in order to solve the problem faster automatically. See what has been done in #27491
Sorry for the lack of feedback, I plan to come back to revew this PR soonish. In the meantime, could you please expand the right handside of this benchmark plot: #31247 (comment) to go to even larger number of data points and/or components? Also, could you please try to use log scaled axis to ease extrapolation assuming the gap between the two method is not a fixed constant? |
Fixes #31246
Implemented randomized_eigh(selection='values') and integrated it into KernelPCA and Isomap
Introduced a new eigenvalue decomposition function randomized_eigh(values) for faster computation.
Integrated this solver into both KernelPCA and Isomap as an alternative to dense solvers.
Added comprehensive tests in extmath.py to validate the decomposition accuracy.
Benchmarked against existing solvers, comparing:
The benchmark result graphs comparing execution time and reconstruction error with existing solvers will be added in the comment below.