[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) #8742

wallygauze · 2017-04-13T20:39:43Z

Reference Issue

Fixes #8484

What does this implement/fix? Explain your changes.

'Handling n_components==None' code in _fit method
Extended (to include n_samples as a limit) relevant ValueError in _fit_full method
Extended ValueError raised for (not 1 <= n_components <= n_features) and ValueError raised for (svd_solver == 'arpack' and n_components == n_features) in _fit_truncated

Documentation changes:

n_component parameter documentation, & mentions elsewhere in parameters section
n_components_ attribute documentation
unrelated (to issue) extra: corrected documentation for explained_variance_ratio_ attribute

Any other comments?

Original pull-request: #8486

…ge in docs

jnothman · 2017-05-27T14:02:29Z

sklearn/decomposition/pca.py

@@ -367,7 +371,10 @@ def _fit(self, X):

        # Handle n_components==None
        if self.n_components is None:
-            n_components = X.shape[1]
+            if self.svd_solver is not 'arpack':


should use != for string comparison (two string objects may have same text but different id)

Thanks. I didn't understand the exact functioning of the "is" statement.

jnothman · 2017-05-27T14:07:22Z

sklearn/decomposition/tests/test_pca.py

    for solver in solver_list:
        for n_components in [-1, 3]:
-            assert_raises(ValueError,
-                          PCA(n_components, svd_solver=solver).fit, X)
+            assert_raises_regex(ValueError,


Should check the same for X.T if you want it to be invariant to axis.

I don't think you test the boundary case... if we really need arpack to be different to the others. (I'm not very familiar with this code / solvers.)

X.T: I can add it alright.

Boundary case: the shape of X is (2,3); so for the "full" solver the maximum number of components is 2, while for "arpack" it is 1.
Regarding "I'm not very familiar with this code / solvers", I'm not sure what you mean. I actually just extended the non-regression test that was already present for my fix, so the original test may not be best practice. It looked good and functional to me.

Again, the point is to add that the error is correct for both n_samples < n_features and n_samples > n_features.

jnothman · 2017-05-27T14:08:07Z

sklearn/decomposition/tests/test_pca.py

+        pca = PCA(svd_solver=solver)
+        pca.fit(X)
+        if solver == 'arpack':
+            assert_equal(pca.n_components_, min(X.shape)-1)


space around - please

jnothman

Otherwise, this lgtm if its premise is correct (and I suspect it is, but I'm not sure)

Could you please add an entry to what's new?

jnothman · 2017-05-28T04:59:16Z

sklearn/decomposition/tests/test_pca.py

+        pca = PCA(svd_solver=solver)
+        pca.fit(X)
+        if solver == 'arpack':
+            assert_equal(pca.n_components_, min(X.shape) - 1)


Again, if you're only trying on one dataset, min(shape) is fixed and so this assertion does not check well if the implementation is correct

Thanks jnothman. As I said, I can add it “alright", but I am not sure of the relevance.
This particular part tests this code:

# Handle n_components==None if self.n_components is None: if self.svd_solver != 'arpack': n_components = min(X.shape) else: n_components = min(X.shape) - 1 else: n_components = self.n_components

Not only should the axis make no difference (min(X.shape) is also used in the tested code), the assert_equal() function running successfully can only be achieved with the correct n_components, which is all we want.

Do tell me if I'm still missing something.

The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features and n_samples > n_features.

wallygauze · 2017-05-29T22:02:47Z

EDIT: Forget all that's written underneath. I was young and clueless^^.

@jnothman "Could you please add an entry to what's new?"

You mean, you want me to summarize the main changes brought by this PR?

Originally, it was just:

(1) fixing the fact that inputting n_components>n_features in PCA() raises an error, but not n_components>n_samples,
plus the fact that (2) with n_components==None, the n_components_ attribute is taken from n_features instead of min(n_samples, n_features).

THEN I also realised (I wrote this in my original PR, not the Issue page) while running the tests that (3) pca.py failed to handle n_components==None if the solver is 'arpack' as it did not take min(n_samples, n_features) MINUS 1 .

(was that what you wanted?)

wallygauze · 2017-07-03T00:40:36Z

Anyone, please kindly update me on what work is needed here - when you can. Do consider the original PR #8486 if you want to see the others reviews made before.

@jnothman @agramfort @glemaitre

EDIT on 6 Aug: Never mind. I didn't know at the time about the whats_new file, so I didn't get the last thing I was told to do. This obviously is all finished, so it's just a matter of whether we add it to 0.19 or a subsequent release.

jnothman · 2017-07-03T02:46:34Z

I don't think this is high priority for the upcoming release, so if we don't give it enough attention soon, please ping in a couple of weeks

…

On 3 Jul 2017 10:40 am, "wallygauze" ***@***.***> wrote: Anyone, please kindly update me on what work is needed here - when you can. Do consider the original PR #8486 <#8486> if you want to see the others reviews made before. @jnothman <https://github.com/jnothman> @agramfort <https://github.com/agramfort> @glemaitre <https://github.com/glemaitre> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8742 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67tJNq3uWuT-grYj8W9Om45SddW1ks5sKDiGgaJpZM4M9LVC> .

wallygauze · 2017-08-15T15:27:40Z

Apologies, I didn't get what you meant by "Could you please add an entry to what's new?" at the time.
Now that's done, and the work on this PR seems to be pretty much over

amueller · 2017-08-15T16:35:11Z

sklearn/decomposition/pca.py

-        n_components cannot be equal to n_features for svd_solver == 'arpack'.
+        explained is greater than the percentage specified by n_components.
+        If svd_solver == 'arpack', the number of components must be strictly
+        less than the minimum of n_features and n_samples:


the equation says equal the text says strictly less. I'm confused why it should be strictly less but it looks like that is what was implemented before.

amueller · 2017-08-15T16:38:42Z

sklearn/decomposition/pca.py

+            if self.svd_solver != 'arpack':
+                n_components = min(X.shape)
+            else:
+                n_components = min(X.shape) - 1


What are the exact conditions on the arpack solver?

the solver uses scipy.sparse.linalg.svds, and according to the docs
the k parameter (n_components) must be such that 1 <= k < min(A.shape).

hm then this seems fine. alright.

amueller · 2017-08-15T16:40:47Z

sklearn/decomposition/tests/test_pca.py

+        pca = PCA(svd_solver=solver)
+        pca.fit(X)
+        if solver == 'arpack':
+            assert_equal(pca.n_components_, min(X.shape) - 1)


The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features and n_samples > n_features.

amueller · 2017-08-15T16:42:39Z

sklearn/decomposition/tests/test_pca.py

    for solver in solver_list:
        for n_components in [-1, 3]:
-            assert_raises(ValueError,
-                          PCA(n_components, svd_solver=solver).fit, X)
+            assert_raises_regex(ValueError,


Again, the point is to add that the error is correct for both n_samples < n_features and n_samples > n_features.

amueller · 2017-08-15T20:10:59Z

doc/whats_new.rst

@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering
  division on Python 2 versions. :issue:`9492` by
  :user:`James Bourbeau <jrbourbeau>`.

+- In :class:`decomposition.pca` selecting a n_components parameter greater than


I would either phrase both changes in terms of what the problem was or what the fix was. Right now you you describe the error for the first change and the fix for the second.

should be .PCA, not .pca

amueller · 2017-08-15T20:11:41Z

sklearn/decomposition/tests/test_pca.py

+            assert_equal(pca.n_components_, min(X.shape))
+
+    # We conduct the same test on X.T so that it is invariant to axis.
+    X_2 = X.T


can you do a for-loop instead of repeating the code please?

wallygauze · 2017-08-15T21:33:52Z

Yes, sorry, I was quite careless here. Also, I'm thinking of moving the whats_new entry to Miscellaneous instead of Bug fix?

amueller · 2017-08-16T21:19:30Z

either is fine. You haven't tested the arpack special case, right?

jnothman

Otherwise LGTM

jnothman · 2017-08-17T14:25:47Z

doc/whats_new.rst

@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering
  division on Python 2 versions. :issue:`9492` by
  :user:`James Bourbeau <jrbourbeau>`.

+- In :class:`decomposition.pca` selecting a n_components parameter greater than


should be .PCA, not .pca

wallygauze · 2017-08-18T20:21:33Z

I expect the checks to fail, since after some refactoring my test failed locally.

No idea why, but pushing this here in case someone spots it before me.
I have
AssertionError: "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'" does not match "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'"
but the strings were identical when I compared them.

jnothman · 2017-08-19T11:46:43Z

It's being understood as a regex: those () are not literals.

…

On 19 August 2017 at 06:21, wallygauze ***@***.***> wrote: I expect the checks to fail, since after some refactoring my test failed locally. No idea why, but pushing this here in case someone spots it before me. I have AssertionError: "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'" does not match "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'" but the strings were identical when I compared them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8742 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-9-I62x9vSqmed2lIhgO8eiYdd7ks5sZfJQgaJpZM4M9LVC> .

wallygauze · 2017-09-07T16:06:49Z

ping @jnothman

jnothman · 2017-09-07T22:37:04Z

@amueller do you want to complete your review here?

lesteve · 2017-09-08T11:24:09Z

A little bit of help to fix the conflict:

add your entry in doc/whats_new/v0.20.rst
revert your changes in doc/whats_new.rst

amueller · 2017-09-08T19:09:36Z

lgtm

wallygauze · 2017-09-09T02:06:28Z

whats_new conflict solved, all set and ready to be merged

jnothman · 2017-09-09T22:33:50Z

Thanks @wallygauze for your contributions and your patience!

…d of just n_features (Recreated PR) (scikit-learn#8742)

remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?

…d of just n_features (Recreated PR) (scikit-learn#8742)

Wally added 6 commits April 11, 2017 16:24

fixed issue 8484

967d792

dealt with indentation issues flagged by flake8

8ffff6f

code to handle n_components==None with arpack was missing

cbdffc4

added non-regression tests for my previous changes in pca

279184c

minor change: reverted iterator name in test_pca

4d093ab

changed AssertRaises to regex variant in test, and minor writing chan…

0f38101

…ge in docs

wallygauze changed the title ~~Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR)~~ [MRG] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Apr 13, 2017

jnothman reviewed May 27, 2017

View reviewed changes

Wally added 2 commits May 27, 2017 21:05

corrected pca.py fix

4ee548c

improved test_pca_validation()'s scope

b72ffe4

jnothman approved these changes May 28, 2017

View reviewed changes

wallygauze mentioned this pull request Aug 3, 2017

[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303

Merged

wallygauze and others added 2 commits August 15, 2017 15:19

Merge branch 'master' into pca_8484_2

fcc0139

added an entry to whats_new.rst

110cd18

amueller reviewed Aug 15, 2017

View reviewed changes

wallygauze added 2 commits August 15, 2017 20:42

add requested code for axis-invariance check

c9049f9

Clarified doc change

c89ef02

amueller reviewed Aug 15, 2017

View reviewed changes

Wally added 3 commits August 15, 2017 22:42

rephrased whats_new entry

b91fa3b

fixed flake8

d449868

refactored test code

07c1e1d

jnothman reviewed Aug 17, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR)~~ [MRG+1] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Aug 17, 2017

flake8 corrections

f9af4d6

Wally added 4 commits August 19, 2017 21:06

arpack case was still missing + fixed my test bug + more refactoring

fe7047f

corrected typo

1e7cd10

allow type long?

a528512

accidentally left useless piece of code

f25bd9c

jnothman approved these changes Aug 23, 2017

View reviewed changes

Merge branch 'master' into pca_8484_2

9408366

amueller changed the title ~~[MRG+1] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR)~~ [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Sep 8, 2017

Wally and others added 3 commits September 8, 2017 20:51

reverted changes in doc/whats_new.rst

bd1f151

added entry in whats_new/v0.20.rst

e3ecd12

Merge branch 'master' into pca_8484_2

a7f3020

jnothman merged commit 787d5d2 into scikit-learn:master Sep 9, 2017

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017

[MRG+2] Limiting n_components by both n_features and n_samples instea…

9f88a7d

…d of just n_features (Recreated PR) (scikit-learn#8742)

massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017

[MRG+2] Limiting n_components by both n_features and n_samples instea…

9dfbd13

…d of just n_features (Recreated PR) (scikit-learn#8742)

amueller mentioned this pull request Oct 20, 2017

Release of version 0.19.1 #9607

Merged

This was referenced Oct 31, 2017

Updated documentation for #7947 #8196

Closed

returning n_components problem #7947

Closed

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Limiting n_components by both n_features and n_samples instea…

dfe4f7b

…d of just n_features (Recreated PR) (scikit-learn#8742)

wallygauze deleted the pca_8484_2 branch November 27, 2017 21:45

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+2] Limiting n_components by both n_features and n_samples instea…

2f8f0e5

…d of just n_features (Recreated PR) (scikit-learn#8742)

aloukina mentioned this pull request Jan 28, 2020

Error in PCA test after update to SKLL 2.0 EducationalTestingService/rsmtool#347

Closed

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) #8742

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) #8742

Conversation

wallygauze commented Apr 13, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallygauze May 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallygauze commented May 29, 2017 • edited Loading

wallygauze commented Jul 3, 2017 • edited Loading

jnothman commented Jul 3, 2017 via email

wallygauze commented Aug 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallygauze commented Aug 15, 2017 • edited Loading

amueller commented Aug 16, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallygauze commented Aug 18, 2017

jnothman commented Aug 19, 2017 via email

wallygauze commented Sep 7, 2017

jnothman commented Sep 7, 2017

lesteve commented Sep 8, 2017

amueller commented Sep 8, 2017

wallygauze commented Sep 9, 2017

jnothman commented Sep 9, 2017

wallygauze May 27, 2017 •

edited

Loading

wallygauze commented May 29, 2017 •

edited

Loading

wallygauze commented Jul 3, 2017 •

edited

Loading

wallygauze commented Aug 15, 2017 •

edited

Loading