Skip to content

[WIP] PCA n_components='mle' instability. Issue 4441 #4827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

lbillingham
Copy link

I'm only getting
single lines in the compare view now for

https://github.com/scikit-learn/scikit-learn/compare/master...lbillingham:issue_4441 

hopefully line endings are now actually posix-y.
Closes #4441.

unknown and others added 4 commits June 4, 2015 11:30
…m with a tail -> 0.0 when using `pca(n_components='mle')`
…values and spectrum with a tail -> 0.0 when using in PCA('mle')

 I've run into log(0) errors on data from the wild.
I haven't been able to construct a synthetic pathological X. Since _assess_dimension_ and _infer_dimension_ are imported in the tests I just test using a pathological spectrum.
@amueller
Copy link
Member

amueller commented Jun 5, 2015

looks much better. Btw, you can force-push in old branches to not open new PRs. But you can also just close the other one.

@amueller
Copy link
Member

amueller commented Jun 5, 2015

I would be great if in addition we had a test with a full dataset. If you have a perfectly correlated one, shouldn't that give you zeros? I think make_classification or make_low_rank_matrix should give you one of these.

@lbillingham
Copy link
Author

I'll take a look at better testing. Will prob not get to it for a couple of days though.
Thanks for info about pushing to update an open PR

Sent from my iFern

On 5 Jun 2015, at 21:14, Andreas Mueller notifications@github.com wrote:

I would be great if in addition we had a test with a full dataset. If you have a perfectly correlated one, shouldn't that give you zeros? I think make_classification or make_low_rank_matrix should give you one of these.


Reply to this email directly or view it on GitHub.

@@ -23,7 +24,7 @@


def _assess_dimension_(spectrum, rank, n_samples, n_features):
"""Compute the likelihood of a rank ``rank`` dataset
"""Compute the likelihood of a rank ``rank`` dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute the likelihood that dataset has the given rank?

…all values rather than rely on , fixed function name underscoring
 into issue_4441

Conflicts:
	sklearn/decomposition/pca.py
@lbillingham
Copy link
Author

I've tried making a full test using

X, _ = sklearn.datasets.make_classification(n_repeated=17, n_informative=1, n_clusters_per_class=1, n_classes=1)

and

X = sklearn.datasets..make_low_rank_matrix(effective_rank=2, tail_strength=1.0)

which, I think, are about as likely to cause an error as I can make them.

We do run into v = 0 in v = np.sum(spectrum[rank:]) / (n_features - rank) but I have not been able to force a situation where we get a log(0): . spectrum_[rank:n_features] = v and for i in range(rank) always conspire to avoid the 0.

If there is an off-by-one error and we should be looping to range(rank + 1) then the datasets above would cause log(0) errors.

I've changed the test spectrum to be as suggested and removed the trailing _ from the function names.

I also noticed that

pv = -np.log(v) * n_samples * (n_features - rank) / 2.

was susceptible to log(0), so I've moved the test of if rcond > abs(v): before it.

@lbillingham lbillingham closed this Jun 8, 2015
@lbillingham
Copy link
Author

I didn't mean to close this, sorry
hit wrong button

@lbillingham lbillingham reopened this Jun 8, 2015
@lbillingham
Copy link
Author

sorry, looks like I merged something wrong

@amueller amueller changed the title Issue 4441 [WIP] PCA n_components='mlp' instability. Issue 4441 Jun 8, 2015
@amueller
Copy link
Member

amueller commented Jun 8, 2015

can you please add the test with a synthetic X and fixed random state?

@lbillingham
Copy link
Author

Sure: I'll get to it in ~12 hours or so. I don't think I can find a way to exercise the protective code paths I created using the output of make_low_rank_matrix or make_classification.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

hum ok then. I'll try to have a look later, but maybe someone else finds time?

@@ -75,6 +80,8 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features):
spectrum_ = spectrum.copy()
spectrum_[rank:n_features] = v
for i in range(rank):
if spectrum_[i] < rcond:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can spectrum_ values ever be negative here? If not, then the check above doesn't need the abs either, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spectrum comes from S**2 with _, S, _ = scipy.linalg.svd(X)` so I think it should non-negative.[we'll never see complex numbers will we?]

I'll have a go at constructing an X that gets trapped by the new code paths for a full test. But not been able to so far.

On 9 Jun 2015, at 02:43, Vlad Niculae notifications@github.com wrote:

In sklearn/decomposition/pca.py:

@@ -75,6 +80,8 @@ def assess_dimension(spectrum, rank, n_samples, n_features):
spectrum_ = spectrum.copy()
spectrum_[rank:n_features] = v
for i in range(rank):

  •    if spectrum_[i] < rcond:
    
    Can spectrum_ values ever be negative here? If not, then the check above doesn't need the abs either, right?


Reply to this email directly or view it on GitHub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll never see complex numbers will we?

PCA could be one of the algorithms that might with complex numbers. But we don't support it AFAIK.

We can either remove the abs everywhere, or add it also inside the v summation. Couldn't say which is better.

….inf`` codepath in `pca._assess_dimension`. Removed redundant `abs(v)` as `v` composed by summing squared real numbers.
@lbillingham
Copy link
Author

@vene you are right about the abs() being surplus to requirements.
I've got an X that goes through the test on rcond > v, have not been able to make one for spectrum_[i] < rcond

@lbillingham
Copy link
Author

Anything else I need to do on this?

@amueller
Copy link
Member

Sorry, we have a lot of things to review and not enough people.

n_redundant=1, n_clusters_per_class=1,
random_state=20150609)
pca = PCA(n_components='mle').fit(X)
assert_equal(pca.n_components_, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems really odd to me. Shouldn't it be 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the off-by-one error that @alexis-mignon talked about in the original comment on 24 Mar?

@lbillingham
Copy link
Author

No worries, so long as I'm not causing a hold up

On 16 June 2015 at 16:18, Andreas Mueller notifications@github.com wrote:

Sorry, we have a lot of things to review and not enough people.


Reply to this email directly or view it on GitHub
#4827 (comment)
.

@rth
Copy link
Member

rth commented Jun 8, 2017

@lbillingham Could your please resolve conflicts or rebase? Thanks.

@lbillingham
Copy link
Author

Update on this, I currently only a work computer or a tablet that is not up to this kind of work.
I've asked if I can do FOSS work on my time but company hardware: hopefully that will be a yes. But it is taking some time.

@rth
Copy link
Member

rth commented Nov 30, 2017

@lbillingham If you don't have the availability to finish this PR, can it be continued by another contributor? Thanks.

@glemaitre glemaitre added this to the 0.21 milestone Jun 13, 2018
@jnothman jnothman changed the title [WIP] PCA n_components='mlp' instability. Issue 4441 [WIP] PCA n_components='mle' instability. Issue 4441 Apr 11, 2019
@jnothman jnothman removed this from the 0.21 milestone Apr 11, 2019
@yxliang01
Copy link

yxliang01 commented Apr 15, 2019

Wow... This has been 4 years now... And, mle is still completely unusable to me... 😂

@jnothman
Copy link
Member

It's yours to complete if you want, @yxliang01!

@jnothman
Copy link
Member

#10359 is the one to be completed, @yxliang01. Go on, take it over, address the last few comments.

@jnothman
Copy link
Member

I'm closing this one in preference for #10359.

@jnothman jnothman closed this Apr 15, 2019
@yxliang01
Copy link

@jnothman Unfortunately, I don't really understand MLE algorithm nor sklearn itself. So, at least for now, I am not suitable to do so...

@jnothman
Copy link
Member

Pity. It would be nice to see this completed and I think it is not far off

@yxliang01
Copy link

@jnothman Yeah...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems in sklearn.decomposition.PCA with "n_components='mle' option"
7 participants