-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) #8742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/decomposition/pca.py
Outdated
@@ -367,7 +371,10 @@ def _fit(self, X): | |||
|
|||
# Handle n_components==None | |||
if self.n_components is None: | |||
n_components = X.shape[1] | |||
if self.svd_solver is not 'arpack': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use !=
for string comparison (two string objects may have same text but different id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I didn't understand the exact functioning of the "is" statement.
for solver in solver_list: | ||
for n_components in [-1, 3]: | ||
assert_raises(ValueError, | ||
PCA(n_components, svd_solver=solver).fit, X) | ||
assert_raises_regex(ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check the same for X.T if you want it to be invariant to axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you test the boundary case... if we really need arpack to be different to the others. (I'm not very familiar with this code / solvers.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.T: I can add it alright.
Boundary case: the shape of X is (2,3); so for the "full" solver the maximum number of components is 2, while for "arpack" it is 1.
Regarding "I'm not very familiar with this code / solvers", I'm not sure what you mean. I actually just extended the non-regression test that was already present for my fix, so the original test may not be best practice. It looked good and functional to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the point is to add that the error is correct for both n_samples < n_features
and n_samples > n_features
.
pca = PCA(svd_solver=solver) | ||
pca.fit(X) | ||
if solver == 'arpack': | ||
assert_equal(pca.n_components_, min(X.shape)-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around -
please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, this lgtm if its premise is correct (and I suspect it is, but I'm not sure)
Could you please add an entry to what's new?
pca = PCA(svd_solver=solver) | ||
pca.fit(X) | ||
if solver == 'arpack': | ||
assert_equal(pca.n_components_, min(X.shape) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, if you're only trying on one dataset, min(shape) is fixed and so this assertion does not check well if the implementation is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks jnothman. As I said, I can add it “alright", but I am not sure of the relevance.
This particular part tests this code:
# Handle n_components==None
if self.n_components is None:
if self.svd_solver != 'arpack':
n_components = min(X.shape)
else:
n_components = min(X.shape) - 1
else:
n_components = self.n_components
Not only should the axis make no difference (min(X.shape) is also used in the tested code), the assert_equal() function running successfully can only be achieved with the correct n_components, which is all we want.
Do tell me if I'm still missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features
and n_samples > n_features
.
EDIT: Forget all that's written underneath. I was young and clueless^^. @jnothman "Could you please add an entry to what's new?" You mean, you want me to summarize the main changes brought by this PR? Originally, it was just:
THEN I also realised (I wrote this in my original PR, not the Issue page) while running the tests that (3) pca.py failed to handle n_components==None if the solver is 'arpack' as it did not take min(n_samples, n_features) MINUS 1 . (was that what you wanted?) |
Anyone, please kindly update me on what work is needed here - when you can. Do consider the original PR #8486 if you want to see the others reviews made before. @jnothman @agramfort @glemaitre EDIT on 6 Aug: Never mind. I didn't know at the time about the whats_new file, so I didn't get the last thing I was told to do. This obviously is all finished, so it's just a matter of whether we add it to 0.19 or a subsequent release. |
I don't think this is high priority for the upcoming release, so if we
don't give it enough attention soon, please ping in a couple of weeks
…On 3 Jul 2017 10:40 am, "wallygauze" ***@***.***> wrote:
Anyone, please kindly update me on what work is needed here - when you
can. Do consider the original PR #8486
<#8486> if you want to
see the others reviews made before.
@jnothman <https://github.com/jnothman> @agramfort
<https://github.com/agramfort> @glemaitre <https://github.com/glemaitre>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8742 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67tJNq3uWuT-grYj8W9Om45SddW1ks5sKDiGgaJpZM4M9LVC>
.
|
Apologies, I didn't get what you meant by "Could you please add an entry to what's new?" at the time. |
sklearn/decomposition/pca.py
Outdated
n_components cannot be equal to n_features for svd_solver == 'arpack'. | ||
explained is greater than the percentage specified by n_components. | ||
If svd_solver == 'arpack', the number of components must be strictly | ||
less than the minimum of n_features and n_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the equation says equal the text says strictly less. I'm confused why it should be strictly less but it looks like that is what was implemented before.
if self.svd_solver != 'arpack': | ||
n_components = min(X.shape) | ||
else: | ||
n_components = min(X.shape) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the exact conditions on the arpack solver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the solver uses scipy.sparse.linalg.svds, and according to the docs
the k parameter (n_components) must be such that 1 <= k < min(A.shape).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm then this seems fine. alright.
pca = PCA(svd_solver=solver) | ||
pca.fit(X) | ||
if solver == 'arpack': | ||
assert_equal(pca.n_components_, min(X.shape) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is more that this is a non-regression test, and we want to make sure the code works with n_samples < n_features
and n_samples > n_features
.
for solver in solver_list: | ||
for n_components in [-1, 3]: | ||
assert_raises(ValueError, | ||
PCA(n_components, svd_solver=solver).fit, X) | ||
assert_raises_regex(ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the point is to add that the error is correct for both n_samples < n_features
and n_samples > n_features
.
doc/whats_new.rst
Outdated
@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering | |||
division on Python 2 versions. :issue:`9492` by | |||
:user:`James Bourbeau <jrbourbeau>`. | |||
|
|||
- In :class:`decomposition.pca` selecting a n_components parameter greater than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would either phrase both changes in terms of what the problem was or what the fix was. Right now you you describe the error for the first change and the fix for the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be .PCA, not .pca
assert_equal(pca.n_components_, min(X.shape)) | ||
|
||
# We conduct the same test on X.T so that it is invariant to axis. | ||
X_2 = X.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do a for-loop instead of repeating the code please?
Yes, sorry, I was quite careless here. Also, I'm thinking of moving the whats_new entry to Miscellaneous instead of Bug fix? |
either is fine. You haven't tested the arpack special case, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
doc/whats_new.rst
Outdated
@@ -51,6 +51,11 @@ Decomposition, manifold learning and clustering | |||
division on Python 2 versions. :issue:`9492` by | |||
:user:`James Bourbeau <jrbourbeau>`. | |||
|
|||
- In :class:`decomposition.pca` selecting a n_components parameter greater than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be .PCA, not .pca
I expect the checks to fail, since after some refactoring my test failed locally. No idea why, but pushing this here in case someone spots it before me. |
It's being understood as a regex: those () are not literals.
…On 19 August 2017 at 06:21, wallygauze ***@***.***> wrote:
I expect the checks to fail, since after some refactoring my test failed
locally.
No idea why, but pushing this here in case someone spots it before me.
I have
AssertionError: "n_components=-1 must be between 0 and min(n_samples,
n_features)=2 with svd_solver='full'" does not match "n_components=-1 must
be between 0 and min(n_samples, n_features)=2 with svd_solver='full'"
but the strings were identical when I compared them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8742 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-9-I62x9vSqmed2lIhgO8eiYdd7ks5sZfJQgaJpZM4M9LVC>
.
|
ping @jnothman |
@amueller do you want to complete your review here? |
A little bit of help to fix the conflict:
|
lgtm |
whats_new conflict solved, all set and ready to be merged |
Thanks @wallygauze for your contributions and your patience! |
…d of just n_features (Recreated PR) (scikit-learn#8742)
…d of just n_features (Recreated PR) (scikit-learn#8742)
remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?
…d of just n_features (Recreated PR) (scikit-learn#8742)
…d of just n_features (Recreated PR) (scikit-learn#8742)
Reference Issue
Fixes #8484
What does this implement/fix? Explain your changes.
Documentation changes:
n_component parameter documentation, & mentions elsewhere in parameters section
n_components_ attribute documentation
unrelated (to issue) extra: corrected documentation for explained_variance_ratio_ attribute
Any other comments?
Original pull-request: #8486