Skip to content

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) #8742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Sep 9, 2017

Conversation

wallygauze
Copy link
Contributor

Reference Issue

Fixes #8484

What does this implement/fix? Explain your changes.

  • 'Handling n_components==None' code in _fit method
  • Extended (to include n_samples as a limit) relevant ValueError in _fit_full method
  • Extended ValueError raised for (not 1 <= n_components <= n_features) and ValueError raised for (svd_solver == 'arpack' and n_components == n_features) in _fit_truncated

Documentation changes:

  • n_component parameter documentation, & mentions elsewhere in parameters section

  • n_components_ attribute documentation

  • unrelated (to issue) extra: corrected documentation for explained_variance_ratio_ attribute

Any other comments?

Original pull-request: #8486

@wallygauze wallygauze changed the title Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) [MRG] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Apr 13, 2017
@@ -367,7 +371,10 @@ def _fit(self, X):

# Handle n_components==None
if self.n_components is None:
n_components = X.shape[1]
if self.svd_solver is not 'arpack':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use != for string comparison (two string objects may have same text but different id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I didn't understand the exact functioning of the "is" statement.

for solver in solver_list:
for n_components in [-1, 3]:
assert_raises(ValueError,
PCA(n_components, svd_solver=solver).fit, X)
assert_raises_regex(ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check the same for X.T if you want it to be invariant to axis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you test the boundary case... if we really need arpack to be different to the others. (I'm not very familiar with this code / solvers.)

Copy link
Contributor Author

@wallygauze wallygauze May 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X.T: I can add it alright.

Boundary case: the shape of X is (2,3); so for the "full" solver the maximum number of components is 2, while for "arpack" it is 1.
Regarding "I'm not very familiar with this code / solvers", I'm not sure what you mean. I actually just extended the non-regression test that was already present for my fix, so the original test may not be best practice. It looked good and functional to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the point is to add that the error is correct for both n_samples < n_features and n_samples > n_features.

pca = PCA(svd_solver=solver)
pca.fit(X)
if solver == 'arpack':
assert_equal(pca.n_components_, min(X.shape)-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space around - please

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, this lgtm if its premise is correct (and I suspect it is, but I'm not sure)

Could you please add an entry to what's new?

pca = PCA(svd_solver=solver)
pca.fit(X)
if solver == 'arpack':
assert_equal(pca.n_components_, min(X.shape) - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if you're only trying on one dataset, min(shape) is fixed and so this assertion does not check well if the implementation is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks jnothman. As I said, I can add it “alright", but I am not sure of the relevance.
This particular part tests this code:

# Handle n_components==None

if self.n_components is None:
    if self.svd_solver != 'arpack':
        n_components = min(X.shape)
    else:
        n_components = min(X.shape) - 1
else:
    n_components = self.n_components

Not only should the axis make no difference (min(X.shape) is also used in the tested code), the assert_equal() function running successfully can only be achieved with the correct n_components, which is all we want.

Do tell me if I'm still missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features and n_samples > n_features.

@wallygauze
Copy link
Contributor Author

wallygauze commented May 29, 2017

EDIT: Forget all that's written underneath. I was young and clueless^^.

@jnothman "Could you please add an entry to what's new?"

You mean, you want me to summarize the main changes brought by this PR?

Originally, it was just:

  • (1) fixing the fact that inputting n_components>n_features in PCA() raises an error, but not n_components>n_samples,
  • plus the fact that (2) with n_components==None, the n_components_ attribute is taken from n_features instead of min(n_samples, n_features).

THEN I also realised (I wrote this in my original PR, not the Issue page) while running the tests that (3) pca.py failed to handle n_components==None if the solver is 'arpack' as it did not take min(n_samples, n_features) MINUS 1 .

(was that what you wanted?)

@wallygauze
Copy link
Contributor Author

wallygauze commented Jul 3, 2017

Anyone, please kindly update me on what work is needed here - when you can. Do consider the original PR #8486 if you want to see the others reviews made before.

@jnothman @agramfort @glemaitre

EDIT on 6 Aug: Never mind. I didn't know at the time about the whats_new file, so I didn't get the last thing I was told to do. This obviously is all finished, so it's just a matter of whether we add it to 0.19 or a subsequent release.

@jnothman
Copy link
Member

jnothman commented Jul 3, 2017 via email

@wallygauze
Copy link
Contributor Author

Apologies, I didn't get what you meant by "Could you please add an entry to what's new?" at the time.
Now that's done, and the work on this PR seems to be pretty much over

n_components cannot be equal to n_features for svd_solver == 'arpack'.
explained is greater than the percentage specified by n_components.
If svd_solver == 'arpack', the number of components must be strictly
less than the minimum of n_features and n_samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the equation says equal the text says strictly less. I'm confused why it should be strictly less but it looks like that is what was implemented before.

if self.svd_solver != 'arpack':
n_components = min(X.shape)
else:
n_components = min(X.shape) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the exact conditions on the arpack solver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solver uses scipy.sparse.linalg.svds, and according to the docs
the k parameter (n_components) must be such that 1 <= k < min(A.shape).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm then this seems fine. alright.

pca = PCA(svd_solver=solver)
pca.fit(X)
if solver == 'arpack':
assert_equal(pca.n_components_, min(X.shape) - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features and n_samples > n_features.

for solver in solver_list:
for n_components in [-1, 3]:
assert_raises(ValueError,
PCA(n_components, svd_solver=solver).fit, X)
assert_raises_regex(ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the point is to add that the error is correct for both n_samples < n_features and n_samples > n_features.

@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering
division on Python 2 versions. :issue:`9492` by
:user:`James Bourbeau <jrbourbeau>`.

- In :class:`decomposition.pca` selecting a n_components parameter greater than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either phrase both changes in terms of what the problem was or what the fix was. Right now you you describe the error for the first change and the fix for the second.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be .PCA, not .pca

assert_equal(pca.n_components_, min(X.shape))

# We conduct the same test on X.T so that it is invariant to axis.
X_2 = X.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do a for-loop instead of repeating the code please?

@wallygauze
Copy link
Contributor Author

wallygauze commented Aug 15, 2017

Yes, sorry, I was quite careless here. Also, I'm thinking of moving the whats_new entry to Miscellaneous instead of Bug fix?

@amueller
Copy link
Member

either is fine. You haven't tested the arpack special case, right?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering
division on Python 2 versions. :issue:`9492` by
:user:`James Bourbeau <jrbourbeau>`.

- In :class:`decomposition.pca` selecting a n_components parameter greater than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be .PCA, not .pca

@jnothman jnothman changed the title [MRG] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) [MRG+1] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Aug 17, 2017
@wallygauze
Copy link
Contributor Author

I expect the checks to fail, since after some refactoring my test failed locally.

No idea why, but pushing this here in case someone spots it before me.
I have
AssertionError: "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'" does not match "n_components=-1 must be between 0 and min(n_samples, n_features)=2 with svd_solver='full'"
but the strings were identical when I compared them.

@jnothman
Copy link
Member

jnothman commented Aug 19, 2017 via email

@wallygauze
Copy link
Contributor Author

ping @jnothman

@jnothman
Copy link
Member

jnothman commented Sep 7, 2017

@amueller do you want to complete your review here?

@lesteve
Copy link
Member

lesteve commented Sep 8, 2017

A little bit of help to fix the conflict:

  • add your entry in doc/whats_new/v0.20.rst
  • revert your changes in doc/whats_new.rst

@amueller amueller changed the title [MRG+1] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) Sep 8, 2017
@amueller
Copy link
Member

amueller commented Sep 8, 2017

lgtm

@wallygauze
Copy link
Contributor Author

whats_new conflict solved, all set and ready to be merged

@jnothman jnothman merged commit 787d5d2 into scikit-learn:master Sep 9, 2017
@jnothman
Copy link
Member

jnothman commented Sep 9, 2017

Thanks @wallygauze for your contributions and your patience!

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
amueller added a commit to amueller/scikit-learn that referenced this pull request Sep 19, 2017
remove outdated comment

fix also for FeatureUnion

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742)

[MRG+1] Remove hard dependency on nose (scikit-learn#9670)

MAINT Stop vendoring sphinx-gallery (scikit-learn#9403)

CI upgrade travis to run on new numpy release (scikit-learn#9096)

CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697)

* doc/datasets/conftest.py to implement the equivalent of nose fixtures
* add conftest.py in root folder to ensure that sklearn local folder
  is used rather than the package in site-packages
* test doc with pytest in Travis
* move custom_data_home definition from nose fixture to .rst file

[MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests

TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710)

TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738)

add some rudimentary tests for meta-estimators

fix extra whitespace in error message

add missing if_delegate_has_method in pipeline

don't test tuple pipeline for now

only copy list if not list already? doesn't seem to help?
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
@wallygauze wallygauze deleted the pca_8484_2 branch November 27, 2017 21:45
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

n_components in PCA explicitly limited by n_features only
4 participants