-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov: "No report found to compare against" ? |
…into n_samples6452
Sugar. Reverting the merge I did while trying to solve the codecov issue did not have the effect I expected (some commits by others have been undone) EDIT: ok, solved. |
@@ -210,11 +210,19 @@ def partial_fit(self, X, y=None, check_input=True): | |||
self.components_ = None | |||
|
|||
if self.n_components is None: | |||
self.n_components_ = n_features | |||
if self.components_ is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add regression tests for these? I guess if n_features < n_samples
we had an error earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the master had an error if n_samples < n_features (you wrote the opposite, but I believe it was a typo right?). As a ‘visual’ aid, this is the partial_fit method, so n_samples is equivalent to the size of the batches used.
looks good apart of missing regression test for n_components=None and |
added the regression test! |
ping @amueller. Should I add an entry to whats_new.rst? |
@wallygauze sorry, too many reviews :-/ yes, whatsnew sounds good :) |
LGTM |
Thanks. |
elif not 1 <= self.n_components <= n_features: | ||
raise ValueError("n_components=%r invalid for n_features=%d, need " | ||
"more rows than columns for IncrementalPCA " | ||
"processing" % (self.n_components, n_features)) | ||
elif not self.n_components <= n_samples: | ||
raise ValueError("n_components=%r must be less or equal to " | ||
"the batch number of samples %d. You can change " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the "You can change either one depending on what you want". It doesn't bring any useful piece of information and the message is clear without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that so. I added that to avoid putting the emphasis on either of the parameters. But you are right that it should still be easy to see even for a beginner.
@lesteve Done |
elif not 1 <= self.n_components <= n_features: | ||
raise ValueError("n_components=%r invalid for n_features=%d, need " | ||
"more rows than columns for IncrementalPCA " | ||
"processing" % (self.n_components, n_features)) | ||
elif not self.n_components <= n_samples: | ||
raise ValueError("n_components=%r must be less or equal to " | ||
"the batch number of samples " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily enough we were chatting with @ogrisel about this yesterday in an unrelated context. IIUC he was hoping that IncrementalPCA
would be able to do partial_fit
on a small number of samples (and converge to something sensible after a few calls to partial_fit). It looks like this is not the case at the moment ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I bumped into the same problem (checking that n_components <= n_features but not n_components <= n_samples) in sklearn/decomposition/pca.py yesterday. There are also slight inconsistencies between _fit_full
and _fit_truncated
. Not in this PR but I think we should have a helper function that is reused where appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve I have a pull-request for PCA as well --> #8742.
It has received a number of reviews and it all seems pretty much finished, but it has not received much attention these last months because it was not marked for the 0.19 release (which on second thoughts may be incongruous since it's practically the same as this.)
Do you want to have a look, I do think it may be a quick case to just finish off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some minor changes in the test. Other than this this looks good.
if self.components_ is None: | ||
self.n_components_ = min(n_samples, n_features) | ||
else: | ||
self.n_components_ = self.components_.shape[0] | ||
elif not 1 <= self.n_components <= n_features: | ||
raise ValueError("n_components=%r invalid for n_features=%d, need " | ||
"more rows than columns for IncrementalPCA " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what "more rows than columns means here" ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the message is good here either, but I wanted to focus my pull request on the points mentioned in #6452 (so that it would be reviewed and merged more quickly).
|
||
|
||
def test_n_components_none(): | ||
# Ensures that n_components == None is handled correctly | ||
rng = np.random.RandomState(1999) | ||
for n_samples, n_features in [(50, 10), (10, 50)]: | ||
|
||
X = rng.rand(n_samples, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the improvements. For this bit here, the reason why I wanted X to be created anew for each partial_fit call was just to replicate the mechanism of batches (i.e. if you get three batches of 30 samples from a dataset with 90 samples, the three batches will be different).
Not sure it's that important for our case, but regardless I just realised my code was producing identical batches anyway, since the random state is fixed beforehand (I guess I would have had to use a different random state for the second call)
@lesteve Can we merge or are we expecting a third person to review the changes you yourself added to the test? It lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than needing the what's new entry to move to 0.20, this looks good to me
"n_components={} invalid for n_features={}, need" | ||
" more rows than columns for IncrementalPCA " | ||
"processing".format(n_components, n_features), | ||
IncrementalPCA(n_components, batch_size=10).fit, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be raised for partial_fit?
Thanks for your contribution and for your patience! |
:D |
Reference Issue
#6452
What does this implement/fix? Explain your changes.
Raises an error when the number of samples of the dataset entered in partial_fit is lower than the number of components
Obviously, the problem mentioned in the reference issue would also present itself when n_components was 'None'. The modified test from the issue resulted in the same failure:
Any other comments?