[WIP] PCA n_components='mle' instability. Issue 4441 #4827

lbillingham · 2015-06-05T19:52:20Z

I'm only getting
single lines in the compare view now for

https://github.com/scikit-learn/scikit-learn/compare/master...lbillingham:issue_4441

hopefully line endings are now actually posix-y.
Closes #4441.

…m with a tail -> 0.0 when using `pca(n_components='mle')`

…m with a tail -> 0.0 when using

…values and spectrum with a tail -> 0.0 when using in PCA('mle') I've run into log(0) errors on data from the wild. I haven't been able to construct a synthetic pathological X. Since _assess_dimension_ and _infer_dimension_ are imported in the tests I just test using a pathological spectrum.

amueller · 2015-06-05T20:09:15Z

looks much better. Btw, you can force-push in old branches to not open new PRs. But you can also just close the other one.

amueller · 2015-06-05T20:14:26Z

I would be great if in addition we had a test with a full dataset. If you have a perfectly correlated one, shouldn't that give you zeros? I think make_classification or make_low_rank_matrix should give you one of these.

lbillingham · 2015-06-06T11:47:18Z

I'll take a look at better testing. Will prob not get to it for a couple of days though.
Thanks for info about pushing to update an open PR

Sent from my iFern

On 5 Jun 2015, at 21:14, Andreas Mueller notifications@github.com wrote:

I would be great if in addition we had a test with a full dataset. If you have a perfectly correlated one, shouldn't that give you zeros? I think make_classification or make_low_rank_matrix should give you one of these.

—
Reply to this email directly or view it on GitHub.

vene · 2015-06-06T12:55:31Z

sklearn/decomposition/pca.py

@@ -23,7 +24,7 @@


 def _assess_dimension_(spectrum, rank, n_samples, n_features):
-    """Compute the likelihood of a rank ``rank`` dataset
+    """Compute the likelihood of a rank ``rank`` dataset.


Compute the likelihood that dataset has the given rank?

…all values rather than rely on , fixed function name underscoring

into issue_4441 Conflicts: sklearn/decomposition/pca.py

lbillingham · 2015-06-08T09:33:11Z

I've tried making a full test using

X, _ = sklearn.datasets.make_classification(n_repeated=17, n_informative=1, n_clusters_per_class=1, n_classes=1)

and

X = sklearn.datasets..make_low_rank_matrix(effective_rank=2, tail_strength=1.0)

which, I think, are about as likely to cause an error as I can make them.

We do run into v = 0 in v = np.sum(spectrum[rank:]) / (n_features - rank) but I have not been able to force a situation where we get a log(0): . spectrum_[rank:n_features] = v and for i in range(rank) always conspire to avoid the 0.

If there is an off-by-one error and we should be looping to range(rank + 1) then the datasets above would cause log(0) errors.

I've changed the test spectrum to be as suggested and removed the trailing _ from the function names.

I also noticed that

pv = -np.log(v) * n_samples * (n_features - rank) / 2.

was susceptible to log(0), so I've moved the test of if rcond > abs(v): before it.

lbillingham · 2015-06-08T09:33:58Z

I didn't mean to close this, sorry
hit wrong button

lbillingham · 2015-06-08T09:49:51Z

sorry, looks like I merged something wrong

…ess.

amueller · 2015-06-08T17:38:15Z

can you please add the test with a synthetic X and fixed random state?

lbillingham · 2015-06-08T20:31:22Z

Sure: I'll get to it in ~12 hours or so. I don't think I can find a way to exercise the protective code paths I created using the output of make_low_rank_matrix or make_classification.

amueller · 2015-06-08T21:02:09Z

hum ok then. I'll try to have a look later, but maybe someone else finds time?

vene · 2015-06-09T01:43:06Z

sklearn/decomposition/pca.py

@@ -75,6 +80,8 @@ def _assess_dimension_(spectrum, rank, n_samples, n_features):
    spectrum_ = spectrum.copy()
    spectrum_[rank:n_features] = v
    for i in range(rank):
+        if spectrum_[i] < rcond:


Can spectrum_ values ever be negative here? If not, then the check above doesn't need the abs either, right?

Spectrum comes from S**2 with _, S, _ = scipy.linalg.svd(X)` so I think it should non-negative.[we'll never see complex numbers will we?]

I'll have a go at constructing an X that gets trapped by the new code paths for a full test. But not been able to so far.

On 9 Jun 2015, at 02:43, Vlad Niculae notifications@github.com wrote:

In sklearn/decomposition/pca.py:

@@ -75,6 +80,8 @@ def assess_dimension(spectrum, rank, n_samples, n_features):
spectrum_ = spectrum.copy()
spectrum_[rank:n_features] = v
for i in range(rank):

if spectrum_[i] < rcond:
Can spectrum_ values ever be negative here? If not, then the check above doesn't need the abs either, right?

—
Reply to this email directly or view it on GitHub.

we'll never see complex numbers will we?

PCA could be one of the algorithms that might with complex numbers. But we don't support it AFAIK.

We can either remove the abs everywhere, or add it also inside the v summation. Couldn't say which is better.

…into issue_4441

….inf`` codepath in `pca._assess_dimension`. Removed redundant `abs(v)` as `v` composed by summing squared real numbers.

lbillingham · 2015-06-09T11:50:15Z

@vene you are right about the abs() being surplus to requirements.
I've got an X that goes through the test on rcond > v, have not been able to make one for spectrum_[i] < rcond

lbillingham · 2015-06-16T06:50:34Z

Anything else I need to do on this?

amueller · 2015-06-16T15:17:43Z

Sorry, we have a lot of things to review and not enough people.

amueller · 2015-06-16T15:18:19Z

sklearn/decomposition/tests/test_pca.py

+                                        n_redundant=1, n_clusters_per_class=1,
+                                        random_state=20150609)
+    pca = PCA(n_components='mle').fit(X)
+    assert_equal(pca.n_components_, 0)


That seems really odd to me. Shouldn't it be 1?

Is this related to the off-by-one error that @alexis-mignon talked about in the original comment on 24 Mar?

lbillingham · 2015-06-16T15:40:10Z

No worries, so long as I'm not causing a hold up

On 16 June 2015 at 16:18, Andreas Mueller notifications@github.com wrote:

Sorry, we have a lot of things to review and not enough people.

—
Reply to this email directly or view it on GitHub
#4827 (comment)
.

rth · 2017-06-08T20:31:33Z

@lbillingham Could your please resolve conflicts or rebase? Thanks.

lbillingham · 2017-06-15T07:43:48Z

Update on this, I currently only a work computer or a tablet that is not up to this kind of work.
I've asked if I can do FOSS work on my time but company hardware: hopefully that will be a yes. But it is taking some time.

rth · 2017-11-30T15:45:38Z

@lbillingham If you don't have the availability to finish this PR, can it be continued by another contributor? Thanks.

yxliang01 · 2019-04-15T23:02:54Z

Wow... This has been 4 years now... And, mle is still completely unusable to me... 😂

jnothman · 2019-04-15T23:39:07Z

It's yours to complete if you want, @yxliang01!

jnothman · 2019-04-15T23:40:09Z

#10359 is the one to be completed, @yxliang01. Go on, take it over, address the last few comments.

jnothman · 2019-04-15T23:40:31Z

I'm closing this one in preference for #10359.

yxliang01 · 2019-04-16T04:43:35Z

@jnothman Unfortunately, I don't really understand MLE algorithm nor sklearn itself. So, at least for now, I am not suitable to do so...

jnothman · 2019-04-16T05:14:43Z

Pity. It would be nice to see this completed and I think it is not far off

yxliang01 · 2019-04-16T11:55:15Z

@jnothman Yeah...

unknown and others added 4 commits June 4, 2015 11:30

Partially fixes scikit-learn#4441. Protection from log(0) and spectru…

33bba18

…m with a tail -> 0.0 when using `pca(n_components='mle')`

Partially fixes scikit-learn#4441. Protection from log(0) and spectru…

de56f1a

…m with a tail -> 0.0 when using

hopefully fixing line endings

2c0f5c7

vene reviewed Jun 6, 2015
View reviewed changes

lbillingham added 2 commits June 8, 2015 10:15

sanitized test , explicitally added an as an argument for ignoring sm…

6c6443e

…all values rather than rely on , fixed function name underscoring

Merge branch 'issue_4441' of https://github.com/lbillingham/scikit-learn

6f3d594

into issue_4441 Conflicts: sklearn/decomposition/pca.py

lbillingham closed this Jun 8, 2015

lbillingham reopened this Jun 8, 2015

updating changes that hopefully fix the broken build. My repo was a m…

ae53b60

…ess.

amueller changed the title ~~Issue 4441~~ [WIP] PCA n_components='mlp' instability. Issue 4441 Jun 8, 2015

vene reviewed Jun 9, 2015
View reviewed changes

lbillingham added 2 commits June 9, 2015 07:49

Merge branch 'master' of https://github.com/lbillingham/scikit-learn …

3e73cb9

…into issue_4441

added a full test with X to exercise new ` if rcond > v: return -np…

2e9f0c1

….inf`` codepath in `pca._assess_dimension`. Removed redundant `abs(v)` as `v` composed by summing squared real numbers.

amueller reviewed Jun 16, 2015
View reviewed changes

rth mentioned this pull request Nov 30, 2017

sklearn PCA with n_components = 'mle' and svd_solver = 'full' results in math domain error #10217

Closed

thechargedneutron mentioned this pull request Dec 22, 2017

[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' #10359

Closed

jnothman added Stalled help wanted and removed help wanted labels Jan 23, 2018

glemaitre added this to the 0.21 milestone Jun 13, 2018

jnothman changed the title ~~[WIP] PCA n_components='mlp' instability. Issue 4441~~ [WIP] PCA n_components='mle' instability. Issue 4441 Apr 11, 2019

jnothman removed this from the 0.21 milestone Apr 11, 2019

jnothman closed this Apr 15, 2019

lschwetlick mentioned this pull request Jan 25, 2020

[MRG+1] Adress decomposition.PCA mle option problem #16224

Merged

lschwetlick mentioned this pull request Feb 25, 2020

Off-By-One Error in _pca with 'mle' #16546

Closed

Uh oh!

[WIP] PCA n_components='mle' instability. Issue 4441 #4827

[WIP] PCA n_components='mle' instability. Issue 4441 #4827

Uh oh!

Conversation

lbillingham commented Jun 5, 2015

Uh oh!

amueller commented Jun 5, 2015

Uh oh!

amueller commented Jun 5, 2015

Uh oh!

lbillingham commented Jun 6, 2015

Uh oh!

vene Jun 6, 2015

Choose a reason for hiding this comment

Uh oh!

lbillingham commented Jun 8, 2015

Uh oh!

lbillingham commented Jun 8, 2015

Uh oh!

lbillingham commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

lbillingham commented Jun 8, 2015

Uh oh!

amueller commented Jun 8, 2015

Uh oh!

vene Jun 9, 2015

Choose a reason for hiding this comment

Uh oh!

lbillingham Jun 9, 2015

Choose a reason for hiding this comment

Uh oh!

vene Jun 9, 2015

Choose a reason for hiding this comment

Uh oh!

lbillingham commented Jun 9, 2015

Uh oh!

lbillingham commented Jun 16, 2015

Uh oh!

amueller commented Jun 16, 2015

Uh oh!

amueller Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

lbillingham Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

lbillingham commented Jun 16, 2015

Uh oh!

rth commented Jun 8, 2017

Uh oh!

lbillingham commented Jun 15, 2017

Uh oh!

rth commented Nov 30, 2017

Uh oh!

yxliang01 commented Apr 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Apr 15, 2019

Uh oh!

jnothman commented Apr 15, 2019

Uh oh!

jnothman commented Apr 15, 2019

Uh oh!

yxliang01 commented Apr 16, 2019

Uh oh!

jnothman commented Apr 16, 2019

Uh oh!

yxliang01 commented Apr 16, 2019

Uh oh!

Uh oh!

yxliang01 commented Apr 15, 2019 •

edited

Loading