Skip to content

[MRG+1] Adress decomposition.PCA mle option problem #16224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 4, 2020
Merged

[MRG+1] Adress decomposition.PCA mle option problem #16224

merged 21 commits into from
Mar 4, 2020

Conversation

lschwetlick
Copy link
Contributor

@lschwetlick lschwetlick commented Jan 25, 2020

Reference Issues/PRs

closes #10359
closes #4441

This PR uses the proposed solutions from PR #10359 to issue #4441 and translates it to the current version of the decomposition package. We adressed 2 of the 3 issues raised by the reviewer:

  • rename rcond to spectrum_cutoff for more intuitive naming
  • make the spectrum cutoff value contigent upon the machine epsilon for the spectrum's data type

We reverted changes relating to the off-by-one error. Maybe someone with more experience in linear algebra can weigh in and we can fix it as a seperate issue. It was discussed here and here.

other comments

Contributing authors were @gelavizh1, @marijavlajic and @lschwetlick at #Wimlds sprint Berlin :)

CC: @adrinjalali @noatamir

@glemaitre
Copy link
Member

@lschwetlick
Copy link
Contributor Author

lschwetlick commented Jan 30, 2020

I'm a bit stuck due to my limited knowledge of the linear algebra behind this. Two questions occur to me:

  1. If _assess_dimension_ is a private function and is only accessed from within (from _infer_dimension_) and that call does not make use of the spectrum_cutoff function argument, do we really need it? Or should we just omit it and define the default (cutoff at the datatype's epsilon) a hard coded fact?

  2. In reference to the test coverage - I need to test whether I get a sensible result when the rank and number of features are equal. I do not know how to test this in a good way
    except to make sure that I am getting a value when that is the case. In the test_assess_dimension_same_n_rank_and_features I just added I check this but maybe there is a more specific sensible test case we could use?

I appreciate any input :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good

I can confirm that the following new tests fail at master:

  • test_infer_dim_bad_spec
  • test_assess_dimension_same_n_rank_and_features
  • test_assess_dimension_small_eigenvalues
  • test_infer_dim_mle



def test_assess_dimension_same_n_rank_and_features():
# Test that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please finish this line

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v0.23.rst under bug fixes. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@glemaitre
Copy link
Member

If assess_dimension is a private function and is only accessed from within (from infer_dimension) and that call does not make use of the spectrum_cutoff function argument, do we really need it? Or should we just omit it and define the default (cutoff at the datatype's epsilon) a hard coded fact?

I agree. I would remove the parameter because it is never used elsewhere. So let's internally create the threshold.

In addition, could you rename:

  • _infer_dimension_ -> _infer_dimension
  • _assess_dimension_ -> _assess_dimension

They are internal functions so we can change them.

@lschwetlick
Copy link
Contributor Author

lschwetlick commented Feb 15, 2020

Thanks for the input!

Unfortunately I'm having some trouble writing sensible tests for this and I'm starting to understand what is wrong. While I'm a bit reluctant to get into the off by one error problem, I think it might not be avoidable.

So the way I understand it the variable explained_variance_ (later called spectrum) is a vector of the eigenvalues of the data. In the 'mle' fit, we want to ascertain how many dimensions are relevant using this maths. Essentially we take that vector and for each identified eigenvalue, compute a log likelihood.

The problem, which is probably the same as the off-by-one problem we mentioned previously, is that in _infer_dimension we iterate over range(len(spectrum)) and we call out iterated variable rank here.
This is misleading because rank insinuates meaning the rank of the matrix, where rank=0 would mean a matrix that has no variance. However, in _assess_dimension we use the variable rank as an index into our list of eigenvalues, where the first one (i think) refers to a dimension that extists. Actually rank is being used both as an index and as a semantically individual concept and I'm not sure those two jobs match up.

In my understanding, if we find 3 eigenvalues, we should be looking at the LL for the cases rank=0, rank=1, rank=2, and rank=3. Currently we look at ranks 0,1, and 2. That was also why the code coverage was bad before, because the if rank == n_features is never enters unless I artificially directly a call to _assess_dimension.

Still To Do:

  • add an entry to the change log at doc/whats_new/v0.23.rst
  • check the off by one situation
  • write a test case for a known outcome of mle

@lschwetlick lschwetlick changed the title [MRG] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date) [WIP] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date) Feb 16, 2020
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this now fixes the small eigenvalue problem, but whenever there is only one component, our n_components_ is 0?

It seems on master:

In [...]:     X, _ = datasets.make_classification(n_informative=1, n_repeated=0, 
   ...:                                         n_redundant=0, n_clusters_per_class=1, 
   ...:                                         random_state=42) 
   ...:     pca = PCA(n_components='mle').fit(X) 

In [...]: pca.n_components_                                                                                                                                                                                                              
Out[...]: 0

That makes me think this PR is kinda complete as is, and the off by one issue can have its own issue+pr.

You could also add a test with a large cut-off to check that the passed value always works.

@lschwetlick
Copy link
Contributor Author

Okay, so then let's work on the off by one error in a separate issue. I left TODOs in the code, where I found the code coverage never reaches the lines on account of this.

@adrinjalali
Copy link
Member

The CI issue is not related, I opened #16545 to figure it out. You can safely ignore it (I think)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to have this in and work on #16546 on a different PR. Thanks @lschwetlick

@cmarmo
Copy link
Contributor

cmarmo commented Mar 2, 2020

@lschwetlick minimal dependencies have been updated in the meanwhile. Could you please synchronise with upstream? this will update the build and (hopefully) green-lights all the tests. Thanks!

@jnothman
Copy link
Member

jnothman commented Mar 2, 2020

Please change the title from WiP to MRG if this is awaiting review but otherwise ready for merge

@adrinjalali adrinjalali changed the title [WIP] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date) [MRG+1] Adress decomposition.PCA mle option problem Mar 3, 2020
lschwetlick and others added 13 commits March 3, 2020 17:37
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@lschwetlick
Copy link
Contributor Author

Could you please synchronise with upstream?

I rebased the PR onto the current upstream master. I hope that was the right way to synchronise it?

@jnothman
Copy link
Member

jnothman commented Mar 3, 2020

I rebased the PR onto the current upstream master. I hope that was the right way to synchronise it?

Looks good though merging master into your branch is cleaner :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

@cmarmo
Copy link
Contributor

cmarmo commented Mar 4, 2020

Someone available to merge this twice approved all green PR? :)
Thanks!

@rth
Copy link
Member

rth commented Mar 4, 2020

Thanks @lschwetlick (and @cmarmo for the reminder) !

@rth rth merged commit ea31818 into scikit-learn:master Mar 4, 2020
@jnothman
Copy link
Member

jnothman commented Mar 5, 2020 via email

@lschwetlick lschwetlick deleted the pca_fix_new branch March 13, 2020 10:27
ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems in sklearn.decomposition.PCA with "n_components='mle' option"
6 participants