-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' #10359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,7 +31,7 @@ | |
from ..utils.validation import check_is_fitted | ||
|
||
|
||
def _assess_dimension_(spectrum, rank, n_samples, n_features): | ||
def _assess_dimension_(spectrum, rank, n_samples, n_features, rcond=1e-15): | ||
"""Compute the likelihood of a rank ``rank`` dataset | ||
|
||
The dataset is assumed to be embedded in gaussian noise of shape(n, | ||
|
@@ -47,6 +47,9 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |
Number of samples. | ||
n_features : int | ||
Number of features. | ||
rcond : float | ||
Cut-off for values in `spectrum`. Any value lower than this | ||
will be ignored (`default=1e-15`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should the default be a function of dtype? |
||
|
||
Returns | ||
------- | ||
|
@@ -75,6 +78,8 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |
v = 1 | ||
else: | ||
v = np.sum(spectrum[rank:]) / (n_features - rank) | ||
if rcond > v: | ||
return -np.inf | ||
pv = -np.log(v) * n_samples * (n_features - rank) / 2. | ||
|
||
m = n_features * rank - rank * (rank + 1.) / 2. | ||
|
@@ -84,6 +89,8 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |
spectrum_ = spectrum.copy() | ||
spectrum_[rank:n_features] = v | ||
for i in range(rank): | ||
if spectrum_[i] < rcond: | ||
break | ||
for j in range(i + 1, len(spectrum)): | ||
pa += log((spectrum[i] - spectrum[j]) * | ||
(1. / spectrum_[j] - 1. / spectrum_[i])) + log(n_samples) | ||
|
@@ -448,7 +455,8 @@ def _fit_full(self, X, n_components): | |
# Postprocess the number of components required | ||
if n_components == 'mle': | ||
n_components = \ | ||
_infer_dimension_(explained_variance_, n_samples, n_features) | ||
_infer_dimension_(explained_variance_, n_samples, | ||
n_features) + 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you cannot change the behavior in any case like this. Was is it a bug? when I complained before about n_components=0 I was surprised but was the behavior expected for MLE option? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure. But this off by one problem is discussed here. #4827 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there was an off-by-one error here formerly, the argmax seems to correspond correctly to the assessed rank. But if rank=0 is a bad answer, then we don't need to assess rank=0... If so, that's a separate bug, but this +1 should be removed, IMO. |
||
elif 0 < n_components < 1.0: | ||
# number of components for which the cumulated explained | ||
# variance percentage is superior to the desired threshold | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the name
rcond
?