[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303

wallygauze · 2017-07-08T22:36:16Z

Reference Issue

What does this implement/fix? Explain your changes.

Raises an error when the number of samples of the dataset entered in partial_fit is lower than the number of components

Obviously, the problem mentioned in the reference issue would also present itself when n_components was 'None'. The modified test from the issue resulted in the same failure:

import numpy as np
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 10, 50
ipca = IncrementalPCA() # leaving n_components as None results in n_components = n_features, and fails

for i in range(5):
    ipca.partial_fit(np.random.rand(n_samples, n_features))

Any other comments?

wallygauze · 2017-07-08T23:12:58Z

Codecov: "No report found to compare against" ?

…into n_samples6452

…it-learn into n_samples6452" This reverts commit 71c5a73, reversing changes made to d4bd366.

wallygauze · 2017-07-14T16:45:36Z

Sugar. Reverting the merge I did while trying to solve the codecov issue did not have the effect I expected (some commits by others have been undone)

EDIT: ok, solved.

amueller · 2017-07-15T22:59:46Z

sklearn/decomposition/incremental_pca.py

@@ -210,11 +210,19 @@ def partial_fit(self, X, y=None, check_input=True):
            self.components_ = None

        if self.n_components is None:
-            self.n_components_ = n_features
+            if self.components_ is None:


Can you add regression tests for these? I guess if n_features < n_samples we had an error earlier?

Yes, the master had an error if n_samples < n_features (you wrote the opposite, but I believe it was a typo right?). As a ‘visual’ aid, this is the partial_fit method, so n_samples is equivalent to the size of the batches used.

amueller · 2017-07-15T23:00:38Z

looks good apart of missing regression test for n_components=None and n_features < n_samples

…learn into n_samples6452

wallygauze · 2017-07-17T13:49:12Z

added the regression test!

wallygauze · 2017-07-23T20:23:02Z

ping @amueller. Should I add an entry to whats_new.rst?

amueller · 2017-07-24T17:12:33Z

@wallygauze sorry, too many reviews :-/ yes, whatsnew sounds good :)

amueller · 2017-07-24T17:15:03Z

LGTM

wallygauze · 2017-07-25T04:36:52Z

Thanks.

lesteve · 2017-07-25T07:12:13Z

sklearn/decomposition/incremental_pca.py

        elif not 1 <= self.n_components <= n_features:
            raise ValueError("n_components=%r invalid for n_features=%d, need "
                             "more rows than columns for IncrementalPCA "
                             "processing" % (self.n_components, n_features))
+        elif not self.n_components <= n_samples:
+            raise ValueError("n_components=%r must be less or equal to "
+                             "the batch number of samples %d. You can change "


I would remove the "You can change either one depending on what you want". It doesn't bring any useful piece of information and the message is clear without it.

Is that so. I added that to avoid putting the emphasis on either of the parameters. But you are right that it should still be easy to see even for a beginner.

wallygauze · 2017-07-25T13:56:18Z

@lesteve Done

lesteve · 2017-08-03T09:22:48Z

sklearn/decomposition/incremental_pca.py

        elif not 1 <= self.n_components <= n_features:
            raise ValueError("n_components=%r invalid for n_features=%d, need "
                             "more rows than columns for IncrementalPCA "
                             "processing" % (self.n_components, n_features))
+        elif not self.n_components <= n_samples:
+            raise ValueError("n_components=%r must be less or equal to "
+                             "the batch number of samples "


Funnily enough we were chatting with @ogrisel about this yesterday in an unrelated context. IIUC he was hoping that IncrementalPCA would be able to do partial_fit on a small number of samples (and converge to something sensible after a few calls to partial_fit). It looks like this is not the case at the moment ...

Also I bumped into the same problem (checking that n_components <= n_features but not n_components <= n_samples) in sklearn/decomposition/pca.py yesterday. There are also slight inconsistencies between _fit_full and _fit_truncated. Not in this PR but I think we should have a helper function that is reused where appropriate.

@lesteve I have a pull-request for PCA as well --> #8742.
It has received a number of reviews and it all seems pretty much finished, but it has not received much attention these last months because it was not marked for the 0.19 release (which on second thoughts may be incongruous since it's practically the same as this.)

Do you want to have a look, I do think it may be a quick case to just finish off.

lesteve

I pushed some minor changes in the test. Other than this this looks good.

lesteve · 2017-08-03T09:56:30Z

sklearn/decomposition/incremental_pca.py

+            if self.components_ is None:
+                self.n_components_ = min(n_samples, n_features)
+            else:
+                self.n_components_ = self.components_.shape[0]
        elif not 1 <= self.n_components <= n_features:
            raise ValueError("n_components=%r invalid for n_features=%d, need "
                             "more rows than columns for IncrementalPCA "


I have no idea what "more rows than columns means here" ...

I don't think the message is good here either, but I wanted to focus my pull request on the points mentioned in #6452 (so that it would be reviewed and merged more quickly).

wallygauze · 2017-08-03T15:00:08Z

sklearn/decomposition/tests/test_incremental_pca.py



 def test_n_components_none():
    # Ensures that n_components == None is handled correctly
    rng = np.random.RandomState(1999)
    for n_samples, n_features in [(50, 10), (10, 50)]:
-
+        X = rng.rand(n_samples, n_features)


Thanks for all the improvements. For this bit here, the reason why I wanted X to be created anew for each partial_fit call was just to replicate the mechanism of batches (i.e. if you get three batches of 30 samples from a dataset with 90 samples, the three batches will be different).

Not sure it's that important for our case, but regardless I just realised my code was producing identical batches anyway, since the random state is fixed beforehand (I guess I would have had to use a different random state for the second call)

wallygauze · 2017-08-06T15:55:24Z

@lesteve Can we merge or are we expecting a third person to review the changes you yourself added to the test? It lgtm

jnothman

Other than needing the what's new entry to move to 0.20, this looks good to me

jnothman · 2017-08-06T22:57:13Z

sklearn/decomposition/tests/test_incremental_pca.py

+                            "n_components={} invalid for n_features={}, need"
+                            " more rows than columns for IncrementalPCA "
+                            "processing".format(n_components, n_features),
+                            IncrementalPCA(n_components, batch_size=10).fit, X)


Should this also be raised for partial_fit?

jnothman · 2017-08-14T23:36:57Z

Thanks for your contribution and for your patience!

wallygauze · 2017-08-15T14:47:17Z

:D

…talPCA (scikit-learn#9303)

Wally added 3 commits July 8, 2017 21:03

fixed bug (not tested), writing test

d382bb4

removed lower interval comparison check from fix, more work on test

fcb2768

fix was failing another test, + finished test for fix

d4bd366

wallygauze changed the title ~~[WIP] Raising an error when batch_size < n_components.~~ [WIP] Raising an error when batch_size < n_components in IncrementalPCA Jul 8, 2017

Wally and others added 3 commits July 10, 2017 15:55

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

71c5a73

…into n_samples6452

Revert "Merge branch 'master' of https://github.com/scikit-learn/scik…

2cff58d

…it-learn into n_samples6452" This reverts commit 71c5a73, reversing changes made to d4bd366.

Merge branch 'master' into n_samples6452

624e3dd

wallygauze added 6 commits July 14, 2017 17:51

Correcting side-effects from reverting merge

5b250ce

Correction number 2

c508034

Correction number 3

e6b38e3

Correction number 4

93f7301

Correction number 5

1acfd8b

Last Correction

289a8ac

wallygauze mentioned this pull request Jul 14, 2017

IncrementalPCA.partial_fit gives an uninformative error message and is possibly broken #6452

Closed

wallygauze changed the title ~~[WIP] Raising an error when batch_size < n_components in IncrementalPCA~~ [MRG] Raising an error when batch_size < n_components in IncrementalPCA Jul 14, 2017

amueller mentioned this pull request Jul 15, 2017

Possible bug with Incremental PCA #9330

Closed

amueller added this to the 0.19 milestone Jul 15, 2017

amueller reviewed Jul 15, 2017

View reviewed changes

Wally added 3 commits July 17, 2017 05:20

added regression tests for n_comp=None case in incremental pca

be5ac2d

Merge branch 'n_samples6452' of https://github.com/wallygauze/scikit-…

090c0f4

…learn into n_samples6452

some lines were never used, turned to code better for coverage

eee25b3

amueller changed the title ~~[MRG] Raising an error when batch_size < n_components in IncrementalPCA~~ [MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA Jul 24, 2017

amueller approved these changes Jul 24, 2017

View reviewed changes

Update whats_new.rst

46fd392

lesteve reviewed Jul 25, 2017

View reviewed changes

wallygauze added 2 commits July 25, 2017 10:58

modifying error message (part 1)

a755554

modifying error message part2

522ebe0

lesteve reviewed Aug 3, 2017

View reviewed changes

Minor improvements in test_pca.py

5bdc0f3

lesteve reviewed Aug 3, 2017

View reviewed changes

wallygauze commented Aug 3, 2017

View reviewed changes

jnothman reviewed Aug 6, 2017

View reviewed changes

Wally and others added 3 commits August 14, 2017 12:56

moved entry to 0.20

d15c601

Merge branch 'master' into n_samples6452

5d989a9

Merge branch 'master' into n_samples6452

41d4613

jnothman merged commit baa2048 into scikit-learn:master Aug 14, 2017

wallygauze deleted the n_samples6452 branch August 15, 2017 14:44

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 1] Raising an error when batch_size < n_components in Incremen…

5e53231

…talPCA (scikit-learn#9303)

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG + 1] Raising an error when batch_size < n_components in Incremen…

73b59fa

…talPCA (scikit-learn#9303)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG + 1] Raising an error when batch_size < n_components in Incremen…

2e44315

…talPCA (scikit-learn#9303)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG + 1] Raising an error when batch_size < n_components in Incremen…

51ff528

…talPCA (scikit-learn#9303)

lesteve mentioned this pull request Jan 4, 2018

IncrementalPCA - Dimension bug #10400

Closed

lanzagar mentioned this pull request Oct 2, 2018

IncrementalPCA fails if data size % batch size < n_components #12234

Closed

Uh oh!

[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303

[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303

Uh oh!

Conversation

wallygauze commented Jul 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

wallygauze commented Jul 8, 2017

Uh oh!

wallygauze commented Jul 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 15, 2017

Uh oh!

wallygauze commented Jul 17, 2017

Uh oh!

wallygauze commented Jul 23, 2017

Uh oh!

amueller commented Jul 24, 2017

Uh oh!

amueller commented Jul 24, 2017

Uh oh!

wallygauze commented Jul 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallygauze Jul 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallygauze commented Jul 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallygauze Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallygauze commented Aug 6, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 14, 2017

Uh oh!

wallygauze commented Aug 15, 2017

Uh oh!

Uh oh!

wallygauze commented Jul 8, 2017 •

edited

Loading

wallygauze commented Jul 14, 2017 •

edited

Loading

wallygauze Jul 25, 2017 •

edited

Loading

wallygauze Aug 3, 2017 •

edited

Loading