-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' #10359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' #10359
Conversation
n_redundant=1, n_clusters_per_class=1, | ||
random_state=42) | ||
pca = PCA(n_components='mle').fit(X) | ||
assert_equal(pca.n_components_, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it surprising to get 0 components even if you have 1 informative feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agramfort Yes, indeed. There's an off by one problem as found in #4827 (comment) . But I am reluctant to change
scikit-learn/sklearn/decomposition/pca.py
Line 105 in 5311c81
return ll.argmax() |
ll.agrmax() + 1
as it may break existing code. Any workaround for this?
…into pca_svd_solver
@agramfort Does this look correct now? |
@@ -448,7 +455,8 @@ def _fit_full(self, X, n_components): | |||
# Postprocess the number of components required | |||
if n_components == 'mle': | |||
n_components = \ | |||
_infer_dimension_(explained_variance_, n_samples, n_features) | |||
_infer_dimension_(explained_variance_, n_samples, | |||
n_features) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot change the behavior in any case like this. Was is it a bug? when I complained before about n_components=0 I was surprised but was the behavior expected for MLE option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure. But this off by one problem is discussed here. #4827 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there was an off-by-one error here formerly, the argmax seems to correspond correctly to the assessed rank. But if rank=0 is a bad answer, then we don't need to assess rank=0... If so, that's a separate bug, but this +1 should be removed, IMO.
if you look here:
http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
you see that MLE gives 10 components as expected.
|
@@ -448,7 +455,8 @@ def _fit_full(self, X, n_components): | |||
# Postprocess the number of components required | |||
if n_components == 'mle': | |||
n_components = \ | |||
_infer_dimension_(explained_variance_, n_samples, n_features) | |||
_infer_dimension_(explained_variance_, n_samples, | |||
n_features) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.
@@ -47,6 +47,9 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |||
Number of samples. | |||
n_features : int | |||
Number of features. | |||
rcond : float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the name rcond
?
@@ -47,6 +47,9 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |||
Number of samples. | |||
n_features : int | |||
Number of features. | |||
rcond : float | |||
Cut-off for values in `spectrum`. Any value lower than this | |||
will be ignored (`default=1e-15`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default be a function of dtype?
Reference Issues/PRs
Fixes #4441
What does this implement/fix? Explain your changes.
This is a PR originally by @lbillingham in #4827. All thanks to him.
Any other comments?