Skip to content

[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' #10359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thechargedneutron
Copy link
Contributor

@thechargedneutron thechargedneutron commented Dec 22, 2017

Reference Issues/PRs

Fixes #4441

What does this implement/fix? Explain your changes.

This is a PR originally by @lbillingham in #4827. All thanks to him.

Any other comments?

@thechargedneutron thechargedneutron changed the title [WIP] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' [MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' Dec 22, 2017
n_redundant=1, n_clusters_per_class=1,
random_state=42)
pca = PCA(n_components='mle').fit(X)
assert_equal(pca.n_components_, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it surprising to get 0 components even if you have 1 informative feature?

Copy link
Contributor Author

@thechargedneutron thechargedneutron Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort Yes, indeed. There's an off by one problem as found in #4827 (comment) . But I am reluctant to change

return ll.argmax()
this line to ll.agrmax() + 1 as it may break existing code. Any workaround for this?

@thechargedneutron
Copy link
Contributor Author

@agramfort Does this look correct now?

@@ -448,7 +455,8 @@ def _fit_full(self, X, n_components):
# Postprocess the number of components required
if n_components == 'mle':
n_components = \
_infer_dimension_(explained_variance_, n_samples, n_features)
_infer_dimension_(explained_variance_, n_samples,
n_features) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot change the behavior in any case like this. Was is it a bug? when I complained before about n_components=0 I was surprised but was the behavior expected for MLE option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. But this off by one problem is discussed here. #4827 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there was an off-by-one error here formerly, the argmax seems to correspond correctly to the assessed rank. But if rank=0 is a bad answer, then we don't need to assess rank=0... If so, that's a separate bug, but this +1 should be removed, IMO.

@agramfort
Copy link
Member

agramfort commented Dec 28, 2017 via email

@@ -448,7 +455,8 @@ def _fit_full(self, X, n_components):
# Postprocess the number of components required
if n_components == 'mle':
n_components = \
_infer_dimension_(explained_variance_, n_samples, n_features)
_infer_dimension_(explained_variance_, n_samples,
n_features) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.

@@ -47,6 +47,9 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features):
Number of samples.
n_features : int
Number of features.
rcond : float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the name rcond?

@@ -47,6 +47,9 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features):
Number of samples.
n_features : int
Number of features.
rcond : float
Cut-off for values in `spectrum`. Any value lower than this
will be ignored (`default=1e-15`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the default be a function of dtype?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems in sklearn.decomposition.PCA with "n_components='mle' option"
4 participants