Skip to content

Conversation

wallygauze
Copy link
Contributor

@wallygauze wallygauze commented Jul 8, 2017

Reference Issue

#6452

What does this implement/fix? Explain your changes.

Raises an error when the number of samples of the dataset entered in partial_fit is lower than the number of components

Obviously, the problem mentioned in the reference issue would also present itself when n_components was 'None'. The modified test from the issue resulted in the same failure:

import numpy as np
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 10, 50
ipca = IncrementalPCA() # leaving n_components as None results in n_components = n_features, and fails

for i in range(5):
    ipca.partial_fit(np.random.rand(n_samples, n_features))

Any other comments?

@wallygauze wallygauze changed the title [WIP] Raising an error when batch_size < n_components. [WIP] Raising an error when batch_size < n_components in IncrementalPCA Jul 8, 2017
@wallygauze
Copy link
Contributor Author

Codecov: "No report found to compare against" ?

@wallygauze
Copy link
Contributor Author

wallygauze commented Jul 14, 2017

Sugar. Reverting the merge I did while trying to solve the codecov issue did not have the effect I expected (some commits by others have been undone)

EDIT: ok, solved.

@wallygauze wallygauze changed the title [WIP] Raising an error when batch_size < n_components in IncrementalPCA [MRG] Raising an error when batch_size < n_components in IncrementalPCA Jul 14, 2017
@amueller amueller added this to the 0.19 milestone Jul 15, 2017
@@ -210,11 +210,19 @@ def partial_fit(self, X, y=None, check_input=True):
self.components_ = None

if self.n_components is None:
self.n_components_ = n_features
if self.components_ is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add regression tests for these? I guess if n_features < n_samples we had an error earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the master had an error if n_samples < n_features (you wrote the opposite, but I believe it was a typo right?). As a ‘visual’ aid, this is the partial_fit method, so n_samples is equivalent to the size of the batches used.

@amueller
Copy link
Member

looks good apart of missing regression test for n_components=None and n_features < n_samples

@wallygauze
Copy link
Contributor Author

added the regression test!

@wallygauze
Copy link
Contributor Author

ping @amueller. Should I add an entry to whats_new.rst?

@amueller
Copy link
Member

@wallygauze sorry, too many reviews :-/ yes, whatsnew sounds good :)

@amueller
Copy link
Member

LGTM

@amueller amueller changed the title [MRG] Raising an error when batch_size < n_components in IncrementalPCA [MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA Jul 24, 2017
@wallygauze
Copy link
Contributor Author

Thanks.

elif not 1 <= self.n_components <= n_features:
raise ValueError("n_components=%r invalid for n_features=%d, need "
"more rows than columns for IncrementalPCA "
"processing" % (self.n_components, n_features))
elif not self.n_components <= n_samples:
raise ValueError("n_components=%r must be less or equal to "
"the batch number of samples %d. You can change "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the "You can change either one depending on what you want". It doesn't bring any useful piece of information and the message is clear without it.

Copy link
Contributor Author

@wallygauze wallygauze Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that so. I added that to avoid putting the emphasis on either of the parameters. But you are right that it should still be easy to see even for a beginner.

@wallygauze
Copy link
Contributor Author

@lesteve Done

elif not 1 <= self.n_components <= n_features:
raise ValueError("n_components=%r invalid for n_features=%d, need "
"more rows than columns for IncrementalPCA "
"processing" % (self.n_components, n_features))
elif not self.n_components <= n_samples:
raise ValueError("n_components=%r must be less or equal to "
"the batch number of samples "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough we were chatting with @ogrisel about this yesterday in an unrelated context. IIUC he was hoping that IncrementalPCA would be able to do partial_fit on a small number of samples (and converge to something sensible after a few calls to partial_fit). It looks like this is not the case at the moment ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I bumped into the same problem (checking that n_components <= n_features but not n_components <= n_samples) in sklearn/decomposition/pca.py yesterday. There are also slight inconsistencies between _fit_full and _fit_truncated. Not in this PR but I think we should have a helper function that is reused where appropriate.

Copy link
Contributor Author

@wallygauze wallygauze Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve I have a pull-request for PCA as well --> #8742.
It has received a number of reviews and it all seems pretty much finished, but it has not received much attention these last months because it was not marked for the 0.19 release (which on second thoughts may be incongruous since it's practically the same as this.)

Do you want to have a look, I do think it may be a quick case to just finish off.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some minor changes in the test. Other than this this looks good.

if self.components_ is None:
self.n_components_ = min(n_samples, n_features)
else:
self.n_components_ = self.components_.shape[0]
elif not 1 <= self.n_components <= n_features:
raise ValueError("n_components=%r invalid for n_features=%d, need "
"more rows than columns for IncrementalPCA "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what "more rows than columns means here" ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the message is good here either, but I wanted to focus my pull request on the points mentioned in #6452 (so that it would be reviewed and merged more quickly).



def test_n_components_none():
# Ensures that n_components == None is handled correctly
rng = np.random.RandomState(1999)
for n_samples, n_features in [(50, 10), (10, 50)]:

X = rng.rand(n_samples, n_features)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the improvements. For this bit here, the reason why I wanted X to be created anew for each partial_fit call was just to replicate the mechanism of batches (i.e. if you get three batches of 30 samples from a dataset with 90 samples, the three batches will be different).

Not sure it's that important for our case, but regardless I just realised my code was producing identical batches anyway, since the random state is fixed beforehand (I guess I would have had to use a different random state for the second call)

@wallygauze
Copy link
Contributor Author

@lesteve Can we merge or are we expecting a third person to review the changes you yourself added to the test? It lgtm

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than needing the what's new entry to move to 0.20, this looks good to me

"n_components={} invalid for n_features={}, need"
" more rows than columns for IncrementalPCA "
"processing".format(n_components, n_features),
IncrementalPCA(n_components, batch_size=10).fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be raised for partial_fit?

@jnothman
Copy link
Member

Thanks for your contribution and for your patience!

@jnothman jnothman merged commit baa2048 into scikit-learn:master Aug 14, 2017
@wallygauze wallygauze deleted the n_samples6452 branch August 15, 2017 14:44
@wallygauze
Copy link
Contributor Author

:D

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants