-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] PCA n_components='mle' instability. Issue 4441 #4827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…m with a tail -> 0.0 when using `pca(n_components='mle')`
…m with a tail -> 0.0 when using
…values and spectrum with a tail -> 0.0 when using in PCA('mle') I've run into log(0) errors on data from the wild. I haven't been able to construct a synthetic pathological X. Since _assess_dimension_ and _infer_dimension_ are imported in the tests I just test using a pathological spectrum.
looks much better. Btw, you can force-push in old branches to not open new PRs. But you can also just close the other one. |
I would be great if in addition we had a test with a full dataset. If you have a perfectly correlated one, shouldn't that give you zeros? I think |
I'll take a look at better testing. Will prob not get to it for a couple of days though. Sent from my iFern
|
sklearn/decomposition/pca.py
Outdated
@@ -23,7 +24,7 @@ | |||
|
|||
|
|||
def _assess_dimension_(spectrum, rank, n_samples, n_features): | |||
"""Compute the likelihood of a rank ``rank`` dataset | |||
"""Compute the likelihood of a rank ``rank`` dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compute the likelihood that dataset has the given rank?
…all values rather than rely on , fixed function name underscoring
I've tried making a full test using X, _ = sklearn.datasets.make_classification(n_repeated=17, n_informative=1, n_clusters_per_class=1, n_classes=1) and X = sklearn.datasets..make_low_rank_matrix(effective_rank=2, tail_strength=1.0) which, I think, are about as likely to cause an error as I can make them. We do run into If there is an off-by-one error and we should be looping to I've changed the test I also noticed that pv = -np.log(v) * n_samples * (n_features - rank) / 2. was susceptible to |
I didn't mean to close this, sorry |
sorry, looks like I merged something wrong |
can you please add the test with a synthetic X and fixed random state? |
Sure: I'll get to it in ~12 hours or so. I don't think I can find a way to exercise the protective code paths I created using the output of |
hum ok then. I'll try to have a look later, but maybe someone else finds time? |
@@ -75,6 +80,8 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features): | |||
spectrum_ = spectrum.copy() | |||
spectrum_[rank:n_features] = v | |||
for i in range(rank): | |||
if spectrum_[i] < rcond: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can spectrum_
values ever be negative here? If not, then the check above doesn't need the abs either, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spectrum comes from S**2
with
_, S, _ = scipy.linalg.svd(X)` so I think it should non-negative.[we'll never see complex numbers will we?]
I'll have a go at constructing an X
that gets trapped by the new code paths for a full test. But not been able to so far.
On 9 Jun 2015, at 02:43, Vlad Niculae notifications@github.com wrote:
In sklearn/decomposition/pca.py:
@@ -75,6 +80,8 @@ def assess_dimension(spectrum, rank, n_samples, n_features):
spectrum_ = spectrum.copy()
spectrum_[rank:n_features] = v
for i in range(rank):
Can spectrum_ values ever be negative here? If not, then the check above doesn't need the abs either, right?if spectrum_[i] < rcond:
—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll never see complex numbers will we?
PCA could be one of the algorithms that might with complex numbers. But we don't support it AFAIK.
We can either remove the abs everywhere, or add it also inside the v
summation. Couldn't say which is better.
…into issue_4441
….inf`` codepath in `pca._assess_dimension`. Removed redundant `abs(v)` as `v` composed by summing squared real numbers.
@vene you are right about the |
Anything else I need to do on this? |
Sorry, we have a lot of things to review and not enough people. |
n_redundant=1, n_clusters_per_class=1, | ||
random_state=20150609) | ||
pca = PCA(n_components='mle').fit(X) | ||
assert_equal(pca.n_components_, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems really odd to me. Shouldn't it be 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the off-by-one error that @alexis-mignon talked about in the original comment on 24 Mar?
No worries, so long as I'm not causing a hold up On 16 June 2015 at 16:18, Andreas Mueller notifications@github.com wrote:
|
@lbillingham Could your please resolve conflicts or rebase? Thanks. |
Update on this, I currently only a work computer or a tablet that is not up to this kind of work. |
@lbillingham If you don't have the availability to finish this PR, can it be continued by another contributor? Thanks. |
Wow... This has been 4 years now... And, |
It's yours to complete if you want, @yxliang01! |
#10359 is the one to be completed, @yxliang01. Go on, take it over, address the last few comments. |
I'm closing this one in preference for #10359. |
@jnothman Unfortunately, I don't really understand MLE algorithm nor sklearn itself. So, at least for now, I am not suitable to do so... |
Pity. It would be nice to see this completed and I think it is not far off |
@jnothman Yeah... |
I'm only getting
single lines in the compare view now for
hopefully line endings are now actually posix-y.
Closes #4441.